Unoffical empeg BBS

Quick Links: Empeg FAQ | RioCar.Org | Hijack | BigDisk Builder | jEmplode | emphatic
Repairs: Repairs

Page 1 of 3 1 2 3 >
Topic Options
#318069 - 13/01/2009 18:03 bash scripting (xml parsing) help... (mainly awk and sed)
hybrid8
carpal tunnel

Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
I wanted a solution to grab trailers from Apple so that I can feed them into my SageTV media center and browse them within its library. Since I haven't been able to find a ready-made solution I thought I'd need to cook something up myself.

But, I managed to find a bash script written by someone who uses a different media center program. It works but did a few things I didn't like to the filenames of the trailers, so I've been modifying it.

I'm running the script from cygwin on a windows box, so I don't have a full set of tools that might otherwise be installed on something like my web server (or I would have likely tried redoing the whole thing in PHP. wink

The script is below (I've commented out some functional lines to allow me to test the values of some variables (the part I'm having issues with right now).

Here's also a link to the original:
http://forum.team-mediaportal.com/plugins-47/mytrailers-42622/index11.html#post291349

Basically you set a path to store a DB which keeps track of what's already been downloaded and another path to store the files. You set a parameter which tells the script whether to try and get 1080 versions. It reads in the XML for Apple's trailer RSS feeds and parses out the filenames and other fields.

I am trying to modify the original to create a folder for each trailer and name that folder and the trailer according to the name of the movie. That is, instead of the original filename which doesn't have any spaces and may contain extra characters like "tlra_640w" etc...

I don't know very much about how to use awk or sed, nor much about scripting with bash, those are the reasons I'm posting. smile

Main issue at the moment is being able to parse the fields grabbed from the XML. They're created in TRAILERS and used to have a semicolon between them. I've changed this to a ";field" to see if I can pick out this field separator from any other legitimate use f semicolon within the data.

But it's still failing to get the name "Angels & Demons" which is the first movie name with a space and a semicolon within it. We'll move to the subject of re-encoding the & and similar later.

This line specifically:

Code:
MOVIETITLE=`echo $MOVIE | awk  'BEGIN { FS = ";field" } ; { print $2 }'`


Is only bringing back the word "Angels" from the above. Probably failing because of the space, but I don't know how to make it bring back its results enclosed in quotes.

Code:
#!/bin/bash

GET1080p=0
GETPOSTER=1
SAVEPATH="v:/Movies/zzztrailertest/"
DLDBPATH="d:/AppleTrailers/"

FEEDS="http://www.apple.com/trailers/home/xml/current_720p.xml http://www.apple.com/trailers/home/xml/current.xml"


tail -5000 $DLDBPATH.downloaded.db > $DLDBPATH.downloaded.db.tmp
mv $DLDBPATH.downloaded.db.tmp $DLDBPATH.downloaded.db

for FEEDURL in $FEEDS; do

TRAILERS=`xml sel --net -D -T -t -m "/records/movieinfo"\
 -v "@id" -o ";field"\
 -v "info/title" -o ";field"\
 -v "info/postdate" -o ";field"\
 -v "preview/large" -o ";field"\
 -v "poster/xlarge"\
 -n $FEEDURL`

for MOVIE in $TRAILERS; do

MOVIEID=`echo $MOVIE | awk  'BEGIN { FS = ";field" } ; { print $1 }'`

MOVIETITLE=`echo $MOVIE | awk  'BEGIN { FS = ";field" } ; { print $2 }'`
	MOVIETITLEFILE=`echo $MOVIETITLE |sed 's/.*\///'`

#temporary output to show grabbed title

echo "=======##### Title: $MOVIETITLE -----------------"


POSTDATE=`echo $MOVIE | awk 'BEGIN { FS = ";field" } ; { print $3 }'`

BEXTENSION="[Trailer].mov"

PREVIEW=`echo $MOVIE | awk 'BEGIN { FS = ";field" } ; { print $4 }'`
	PREVIEWFILE=`echo $PREVIEW |sed 's/.*\///' |sed 's/\.mov$/.hdmov/g'`
	NEWPREVIEWNAME="$MOVIETITLE $BEXTENSION"

PREVIEW1080p=`echo $MOVIE | awk 'BEGIN { FS = ";field" } ; { print $4 }' |sed 's/a720p\.mov$/h1080p.mov/g'`
	PREVIEWFILE1080p=`echo $PREVIEW1080p |sed 's/.*\///' |sed 's/\.mov$/.hdmov/g'`
	NEWPREVIEWNAME1080p="$MOVIETITLE $PREVIEWFILE1080p"

POSTER=`echo $MOVIE | awk 'BEGIN { FS = ";field" } ; { print $5 }'`
	NEWPOSTERNAME="folder.jpg"

MOVIESAVEPATH="$SAVEPATH$MOVIETITLE/"

#if ! grep -q "###$MOVIEID.PREVIEW" $DLDBPATH.downloaded.db; then
#	mkdir $MOVIESAVEPATH
#fi

if [ "$GET1080p" -eq "1" ]; then
 if `echo $FEEDURL | grep -q 720p`; then
	if ! grep -q "###$MOVIEID.PREVIEW" $DLDBPATH.downloaded.db; then
#		wget -c -O "$MOVIESAVEPATH$NEWPREVIEWNAME1080p" $PREVIEW1080p; PREVIEWOUT1080p=$?
                if [ $PREVIEWOUT1080p -eq 0 ]; then
                        echo "###$MOVIEID.PREVIEW $NEWPREVIEWNAME1080p" >> $DLDBPATH.downloaded.db
                else
                        echo "##### ID:$MOVIEID URL:$PREVIEW1080p FAILED -- TRYING ORIGINAL 720p URL NEXT"
                fi
	fi
 fi 
fi

	if ! grep -q "###$MOVIEID.PREVIEW" $DLDBPATH.downloaded.db; then
#		wget -c -O "$MOVIESAVEPATH$NEWPREVIEWNAME" $PREVIEW; PREVIEWOUT=$?
    		if [ $PREVIEWOUT -eq 0 ]; then   
       		 	echo "###$MOVIEID.PREVIEW $NEWPREVIEWNAME" >> $DLDBPATH.downloaded.db
    		else
			echo "##### ID:$MOVIEID URL:$PREVIEW FAILED -- RETRY NEXT RUN"
		fi
	else
		echo "##### ID:$MOVIEID NAME:$NEWPREVIEWNAME MARKED DONE  -- SKIPPING"
	fi

	if [ "$GETPOSTER" -eq "1" ]; then
	 if ! grep -q "###$MOVIEID.POSTER" $DLDBPATH.downloaded.db; then
#		wget -c -O "$MOVIESAVEPATH$NEWPOSTERNAME" $POSTER; POSTEROUT=$?
		if [ $POSTEROUT -eq 0 ]; then
			echo "###$MOVIEID.POSTER $NEWPOSTERNAME" >> $DLDBPATH.downloaded.db
		else
			echo "##### $ID:$MOVIEID URL:$POSTER FAILED -- RETRY NEXT RUN"
		fi
	else
		echo "##### ID:$MOVIEID NAME:$NEWPOSTERNAME MARKED DONE -- SKIPPING"
	 fi
	fi

done

done


Sample XML file (this is what it's parsing when it hits Apple's feed):

http://mypocket.com/current_720p.xml.zip


As mentioned, I'll also need to clean up the results by re-encoding things like & back to "&" and I have no idea how to do that from this script. In PHP I can decode those using a function call to html_entity_decode().

One of the remaining things to look at is I don't know if the usage of sed that's specified when grabbing the name is sufficient. Because I'm using the movie name to create a folder and a file, I can't have things like colons, slashes or other invalid characters be used. I can't guess what will be coming up in future movie names, so it's possible that invalid characters may be present either originally as plain text or from decoding any html entities (such as greater-than or less-than).


_________________________
Bruno
Twisted Melon : Fine Mac OS Software

Top
#318070 - 13/01/2009 18:05 Re: bash scripting (xml parsing) help... [Re: hybrid8]
tfabris
carpal tunnel

Registered: 20/12/1999
Posts: 31571
Loc: Seattle, WA
Brilliant idea. I'd love to have the current apple trailers page listed in my DVR's menu. Hm, my new DVR is networkable, maybe I can set that up somehow too.
_________________________
Tony Fabris

Top
#318071 - 13/01/2009 18:08 Re: bash scripting (xml parsing) help... [Re: tfabris]
hybrid8
carpal tunnel

Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
While this script can be modified to make some type of textual listing with links to individual trailers, I'm actually trying to download them all. This will run daily (cron/schedule) to keep me up to date. I'll eventually come up with something to allow me to expire or remove trailers. Though SageTV does allow me to perform deletions right from its UI.
_________________________
Bruno
Twisted Melon : Fine Mac OS Software

Top
#318076 - 13/01/2009 18:24 Re: bash scripting (xml parsing) help... (mainly awk and sed) [Re: hybrid8]
wfaulk
carpal tunnel

Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
That one line seems to work correctly for me. What does $MOVIE contain at that point?
_________________________
Bitt Faulk

Top
#318078 - 13/01/2009 18:46 Re: bash scripting (xml parsing) help... (mainly awk and sed) [Re: wfaulk]
hybrid8
carpal tunnel

Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
I'll have to dump out the value of $MOVIE, but $MOVIETITLE contains only 'Angels' whereas I'd like it to contain 'Angels & Demons'
_________________________
Bruno
Twisted Melon : Fine Mac OS Software

Top
#318079 - 13/01/2009 19:02 Re: bash scripting (xml parsing) help... (mainly awk and sed) [Re: hybrid8]
hybrid8
carpal tunnel

Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
Ok, $MOVIES doesn't contain the necessary information, so the problem is before the awk to $MOVIETITLE

$MOVIE immediately after the beginning of the for loop contains only the ID and the movie name up to the first word Angels.

The next time it passes through the loop it continues from where it left off, producing invalid results where the $MOVIE var contains only an ampersand

$TRAILERS does contain everything as I expected it to be. The ID, full movie title that already appears to have the ampersand converted from an html entity, date and the rest of the info, for every movie in the XML file.

I'm at least as stuff as I was originally though, having no idea why $MOVIE doesn't contain what I'd expect it to (the full contents of a single "row").
_________________________
Bruno
Twisted Melon : Fine Mac OS Software

Top
#318080 - 13/01/2009 19:09 Re: bash scripting (xml parsing) help... (mainly awk and sed) [Re: hybrid8]
hybrid8
carpal tunnel

Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
Hmmm.. I'm the one that added the movie title to the collection of fields with this line:

Code:
 -v "info/title" -o ";field"\


Then I obviously had to adjust the position of the fields extracted.

Before I made that change, none of the fields captured from the XML contained spaces.

Maybe I should encode the space so that it isn't actually a space? It seems like the space is what's causing the termination of the MOVIE variable in the FOR.


Edited by hybrid8 (13/01/2009 19:11)
_________________________
Bruno
Twisted Melon : Fine Mac OS Software

Top
#318081 - 13/01/2009 19:12 Re: bash scripting (xml parsing) help... (mainly awk and sed) [Re: hybrid8]
peter
carpal tunnel

Registered: 13/07/2000
Posts: 4172
Loc: Cambridge, England
Originally Posted By: hybrid8
Maybe I should encode the space so that it isn't actually a space? It seems like the space is what's causing the termination of the MOVIE variable in the FOR.

Yes, by default "for" splits words at any whitespace (space, tab, newline). To stop it doing that -- to make it split words at newlines only -- set the shell variable IFS to "\n".

Code:
$ cat > hybrid8.txt
a b
c
$ for i in `cat hybrid8.txt` ; do echo $i ; done
a
b
c
$ IFS="\n"
$ for i in `cat hybrid8.txt` ; do echo $i ; done
a b
c
$


Peter

Top
#318082 - 13/01/2009 19:13 Re: bash scripting (xml parsing) help... (mainly awk and sed) [Re: hybrid8]
wfaulk
carpal tunnel

Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
Oh. It probably has nothing to do with the ampersand and everything to do with the space.

"for ... in" splits on whitespace, not newline. Before I spend a lot of time going down this road, delete the "Angels & Demons" line and see if it breaks similarly on "Astro Boy".
_________________________
Bitt Faulk

Top
#318083 - 13/01/2009 19:14 Re: bash scripting (xml parsing) help... (mainly awk and sed) [Re: peter]
wfaulk
carpal tunnel

Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
Originally Posted By: peter
IFS

Good call.
_________________________
Bitt Faulk

Top
#318085 - 13/01/2009 19:39 Re: bash scripting (xml parsing) help... (mainly awk and sed) [Re: wfaulk]
hybrid8
carpal tunnel

Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
Can anyone offer a suggestion as to why setting IFS='\n' causes it to use the "n" character as the field separator?

I saw another suggestion elsewhere to use a WHILE loop instead of a FOR to avoid changing IFS default.


EDIT: Setting it like this

Code:
IFS=$'\n'


Seems to work.

Now I'm just doing some digging to best be able to clean the names before creating files or folders from them (no slashes, colons, gt, lt, etc..) Does anyone already have a suitable script or pointer to something somewhat universal for this?


Edited by hybrid8 (13/01/2009 20:45)
_________________________
Bruno
Twisted Melon : Fine Mac OS Software

Top
#318090 - 13/01/2009 21:06 Re: bash scripting (xml parsing) help... (mainly awk and sed) [Re: hybrid8]
mlord
carpal tunnel

Registered: 29/08/2000
Posts: 14482
Loc: Canada
Awk can do that rather trivially:

echo "anythingyouwant234234" | awk '{gsub("[^a-z]","");print}'

Just add your list of permitted characters/ranges into the "a-z" regular expression. It's much easier to list permitted stuff than to try and enumerate all of the forbidden characters.

Cheers

Top
#318091 - 13/01/2009 21:09 Re: bash scripting (xml parsing) help... (mainly awk and sed) [Re: mlord]
mlord
carpal tunnel

Registered: 29/08/2000
Posts: 14482
Loc: Canada
Originally Posted By: mlord
Awk can do that rather trivially:

echo "anythingyouwant234234" | awk '{gsub("[^a-z]","");print}'


So, for example, from a shell script one could use this sequence:
Code:
NASTYNAME="whatever.. "
GOODNAME=`echo "$NASTYNAME" | awk '{gsub("[^- a-zA-Z0-9_$+]*","");print}'`


EDIT: fixed some issues above now

That's probably a good start at it.


Edited by mlord (13/01/2009 21:12)

Top
#318092 - 13/01/2009 21:11 Re: bash scripting (xml parsing) help... (mainly awk and sed) [Re: mlord]
hybrid8
carpal tunnel

Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
Thanks Mark.

How about having SOME forbidden characters so that I can replace them with specific alternatives? I suppose I could do that before passing the result to the awk example you posted (making sure to include whatever my alternatives are in the list of allowed characters).

In your example, anything not defined in the substitution is replaced with "" correct?

So if I create another command and I omit the ^ which is a NOT if I recall, I can include a list of characters to be replaced.
_________________________
Bruno
Twisted Melon : Fine Mac OS Software

Top
#318093 - 13/01/2009 21:13 Re: bash scripting (xml parsing) help... (mainly awk and sed) [Re: hybrid8]
mlord
carpal tunnel

Registered: 29/08/2000
Posts: 14482
Loc: Canada
Sure thing.. just start with the corrected version I just fixed.

Top
#318094 - 13/01/2009 21:15 Re: bash scripting (xml parsing) help... (mainly awk and sed) [Re: mlord]
mlord
carpal tunnel

Registered: 29/08/2000
Posts: 14482
Loc: Canada
You can add more gsub calls inside the single awk invocation. For example, this version replaces all spaces with underscores, and then does the regular vetting of the result:

Code:
NASTYNAME="whatever.. "
GOODNAME=`echo "$NASTYNAME" | awk '{gsub(" ","_"); gsub("[^- a-zA-Z0-9_$+]*","");print}'`


EDIT: also note that, if a dash - character is wanted inside the [] expression, it has to be FIRST, or immediately after the negation ^ character if present.


Edited by mlord (13/01/2009 21:19)

Top
#318095 - 13/01/2009 21:18 Re: bash scripting (xml parsing) help... (mainly awk and sed) [Re: mlord]
hybrid8
carpal tunnel

Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
Ok, great. I should be able to get what I need with this. Just have to figure out how to extend the ranges in the last sub to include more valid characters like brackets, etc...

If I want to include a quote can I escape it with a backslash? Like so ' \" '

And will I be able to easily include a single quote (actually a normal ascii apostrophe) with "'" ?

How about allowing square brackets? Need to be escaped I suppose?

Lastly, is the space after the dash you mentioned there to delimit that dash from the rest of the characters? So if I want to include a space character as allowed, can I add it anywhere within the [] ?



Edited by hybrid8 (13/01/2009 21:23)
_________________________
Bruno
Twisted Melon : Fine Mac OS Software

Top
#318096 - 13/01/2009 21:22 Re: bash scripting (xml parsing) help... (mainly awk and sed) [Re: hybrid8]
mlord
carpal tunnel

Registered: 29/08/2000
Posts: 14482
Loc: Canada
Originally Posted By: hybrid8
Ok, great. I should be able to get what I need with this.
Just have to figure out how to extend the ranges in the
last sub to include more valid characters like brackets, etc...

If I want to include a quote can I escape it with a backslash? Like so ' \" '

Yeah, except it can get very messy and confusing because the shell itself
may also try to interpret some things like that before passing
the strings to awk. So some double escapes might be needed.

It's because of that fuss, that I normally would just do
the whole script as an awk script rather than a bash script.
No double escaping needed, and a heck of a lot less confusing.

Awk is kinda nice for this stuff, with its C-like control structures
and free typing of things.
But difficult to remember if one only uses it once a year or less. smile

I use it weekly here -- my fav programming language!

Cheers


Edited by mlord (13/01/2009 21:34)

Top
#318097 - 13/01/2009 21:30 Re: bash scripting (xml parsing) help... (mainly awk and sed) [Re: mlord]
mlord
carpal tunnel

Registered: 29/08/2000
Posts: 14482
Loc: Canada
For example, one could separate out the filename sanity stuff
into it's own script in a separate file, like this:

EDIT: added some html escapes for fun.
EDIT: fixed the ordering of a few things.


Code:
#!/usr/bin/gawk -f
{
        ## spaces to underscores:
        gsub(" ","_")

        ## some html escapes:
        gsub(">",">")
        gsub("&lt;","<")
        gsub("&quot;","\"")
        gsub("&amp;","\\&")

        ## square brackets into round brackets:
        gsub("\\[","(")
        gsub("\\]",")")

        ## double-quotes into apostrophes:
        gsub("\"","'")

        ## sanitize the rest:
        gsub("[^- 'a-zA-Z0-9_$+&<>]*","")

        ## dump it to stdout
        print
}


Which could be saved as sanitize.awk, and then be used like this:

Code:
NASTYNAME="whatever.. "
GOODNAME=`echo "$NASTYNAME" | sanitize.awk`



Edited by mlord (13/01/2009 21:46)

Top
#318098 - 13/01/2009 21:48 Re: bash scripting (xml parsing) help... (mainly awk and sed) [Re: mlord]
hybrid8
carpal tunnel

Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
What's up with the double escaping on the square brackets?

I think this is the solution I'll use - the cleaner as its own file.

Some additional q's... Do I need to escape out smart quotes (singles and doubles) or should I instead specify them some other way? How about forward slash? I'm assuming backward slash just needs a single escape like so \\.

I'd like to convert the slashes to dashes, so I need to account for them, otherwise I wouldn't bother and just leave it to the last sub to drop them.

Damn, there's just a boatload of other characters I want to allow as well. smile

Can the following be specified plainly (without escaping)... period, comma, colon, semicolon, question mark and the curly brackets { - and how about the shifted characters above the numerals with the exception or asterisk and carat?

_________________________
Bruno
Twisted Melon : Fine Mac OS Software

Top
#318099 - 13/01/2009 21:57 Re: bash scripting (xml parsing) help... (mainly awk and sed) [Re: hybrid8]
mlord
carpal tunnel

Registered: 29/08/2000
Posts: 14482
Loc: Canada
Many of those have special meaning inside a regular expression, so I suggest you try them one at a time when in doubt.

The backslash will need to become a foursome (\\\\), and other known special characters include the dot, asterisk, dollar-sign, etc..

Here's part of the manpage:
Code:
   Regular Expressions
       Regular expressions are the extended kind found in egrep.
       They are composed of characters as follows:

       c          matches the non-metacharacter c.
       \c         matches the literal character c.
       .          matches any character including newline.
       ^          matches the beginning of a string.
       $          matches the end of a string.
       [abc...]   character list, matches any of the characters abc....
       [^abc...]  negated character list, matches any character except abc....
       r1|r2      alternation: matches either r1 or r2.
       r1r2       concatenation: matches r1, and then r2.
       r+         matches one or more r’s.
       r*         matches zero or more r’s.
       r?         matches zero or one r’s.
       (r)        grouping: matches r.
       r{n}
       r{n,}
       r{n,m}     One or two numbers inside braces denote an interval expression.  If there is one number
                  in  the  braces,  the preceding regular expression r is repeated n times.  If there are
                  two numbers separated by a comma, r is repeated n to m times.  If there is  one  number
                  followed by a comma, then r is repeated at least n times.
                  Interval expressions are only available if either --posix or --re-interval is specified
                  on the command line.

       \y         matches the empty string at either the beginning or the end of a word.

       \B         matches the empty string within a word.

       \<         matches the empty string at the beginning of a word.

       \>         matches the empty string at the end of a word.

       \w         matches any word-constituent character (letter, digit, or underscore).

       \W         matches any character that is not word-constituent.

       \‘         matches the empty string at the beginning of a buffer (string).

       \’         matches the empty string at the end of a buffer.

       The escape sequences that are valid in string constants (see below)  are  also  valid  in  regular
       expressions.


Edited by mlord (13/01/2009 21:59)

Top
#318100 - 13/01/2009 22:06 Re: bash scripting (xml parsing) help... (mainly awk and sed) [Re: mlord]
mlord
carpal tunnel

Registered: 29/08/2000
Posts: 14482
Loc: Canada
Here are some more examples:
Code:
        ## backquotes become apostrophes:
        gsub("`","'")

        ## backslashes become dashes:
        gsub("\\\\","-")

        ## slashes, question marks, carats, dollarsigns become underscores:
        gsub("[/?^$]","_")

That last one above shows one way to deal with characters you aren't sure about: enclose them inside square brackets and they are no longer special (except for backslashes, dashes, or a leading carat).

Top
#318103 - 13/01/2009 22:46 Re: bash scripting (xml parsing) help... (mainly awk and sed) [Re: mlord]
hybrid8
carpal tunnel

Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
Ok, here's what I have so far (I have yet to test this):

Code:
#!/usr/bin/gawk -f
{
        ## some html escapes:
        gsub("&gt;",">")
        gsub("&lt;","<")
        gsub("&quot;","\"")
        gsub("&amp;","\\&")

	## replace fancy "smart" quotes with straight equivalents
        gsub("’","'")
        gsub("‘","'")
        gsub("“","\"")
        gsub("”","\"")
		
	## backquote to apostrophe
	gsub("`","'")	

	## double quote to apostrophe
        gsub("\"","'")
		
	## select illegal filename characaters replaced by alternates  (other illegal characters just dropped later)
        gsub(">",")")
        gsub("<","(")
        gsub("[:]"," - ")
        gsub("[/]","-")
		## backslash to dash
	gsub("\\\\","-")
	
        ## double space to single space:
        gsub("  "," ")


        ## sanitize the rest:
        gsub("[^- 'a-zA-Z0-9 _$+&={}\\[\\]()%@!;,.]*","")

        ## dump it to stdout
        print
}


What's an easy way to strip leading and trailing whitespace? That's about all that's left to do (just in case, but strictly for beautifying).
_________________________
Bruno
Twisted Melon : Fine Mac OS Software

Top
#318104 - 13/01/2009 23:02 Re: bash scripting (xml parsing) help... (mainly awk and sed) [Re: hybrid8]
hybrid8
carpal tunnel

Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
Ok, it seems to work except that the smart quotes stuff I have isn't actually matching what's in the source. In the cygwin shell the offending characters come up as "â?T" which is sort of meaningless. smile

Seems like a case of UTF characters...

If I pipe the output to a file and then open it in a UTF-capable text editor on my Mac then they come up as the normal smart characters.

If I save the awk file as UTF8 then it breaks when piped from the batch file.

Is there a proper way to be able to use UTF8 in bash and awk?

This of course also reminds me that I have to include accented characters as valid. I should have known this was going to get hairier... wink
_________________________
Bruno
Twisted Melon : Fine Mac OS Software

Top
#318105 - 13/01/2009 23:03 Re: bash scripting (xml parsing) help... (mainly awk and sed) [Re: hybrid8]
wfaulk
carpal tunnel

Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
gsub("^[[:blank:]]*", "")
gsub("[[:blank:]]*$", "")

I hate POSIX regexes.
_________________________
Bitt Faulk

Top
#318106 - 13/01/2009 23:09 Re: bash scripting (xml parsing) help... (mainly awk and sed) [Re: hybrid8]
wfaulk
carpal tunnel

Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
Originally Posted By: hybrid8
Is there a proper way to be able to use UTF8 in bash and awk?

HAHAHAHAHAHA

Uh, maybe. If your version of gawk is recent enough and has the right support built in, and you can set your LC_ALL and/or LANG environment variables to "en_US.UTF-8" (or something similar to that; your politics might require you to use "en_CA.UTF-8"), you might get it to work.

Or you could just use perl (or Tcl or Ruby or Python or Forth or Haskell or whatever your pet language might be) instead.


Edited by wfaulk (13/01/2009 23:10)
Edit Reason: americentrism
_________________________
Bitt Faulk

Top
#318107 - 13/01/2009 23:10 Re: bash scripting (xml parsing) help... (mainly awk and sed) [Re: hybrid8]
mlord
carpal tunnel

Registered: 29/08/2000
Posts: 14482
Loc: Canada
Quote:
Is there a proper way to be able to use UTF8 in bash and awk?

Dunno. But if they're single-byte characters, then find their hexcodes and use:
Code:
   String Constants
       String constants in AWK are sequences of characters enclosed between double  quotes  (").   Within
       strings, certain escape sequences are recognized, as in C.  These are:
       \\   A literal backslash.
       \a   The “alert” character; usually the ASCII BEL character.
       \b   backspace.
       \f   form-feed.
       \n   newline.
       \r   carriage return.
       \t   horizontal tab.
       \v   vertical tab.
       \xhex digits
            The  character  represented by the string of hexadecimal digits following the \x.  As in ANSI
            C, all following hexadecimal digits are considered part of the escape sequence.   (This  fea&#8208;
            ture should tell us something about language design by committee.)  E.g., "\x1B" is the ASCII
            ESC (escape) character.
       \ddd The character represented by the 1-, 2-, or 3-digit sequence of octal digits.   E.g.,  "\033"
            is the ASCII ESC (escape) character.
       \c   The literal character c.
       The  escape  sequences may also be used inside constant regular expressions (e.g., /[ \t\f\n\r\v]/
       matches whitespace characters).
       In compatibility mode, the characters represented by octal and hexadecimal  escape  sequences  are
       treated  literally  when  used  in  regular  expression constants.  Thus, /a\52b/ is equivalent to
       /a\*b/.



Top
#318108 - 13/01/2009 23:12 Re: bash scripting (xml parsing) help... (mainly awk and sed) [Re: wfaulk]
hybrid8
carpal tunnel

Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
Bitt, is your example supposed to say "blank" ?
_________________________
Bruno
Twisted Melon : Fine Mac OS Software

Top
#318109 - 13/01/2009 23:12 Re: bash scripting (xml parsing) help... (mainly awk and sed) [Re: hybrid8]
mlord
carpal tunnel

Registered: 29/08/2000
Posts: 14482
Loc: Canada
Quote:
gsub("[^- 'a-zA-Z0-9 _$+&={}\\[\\]()%@!;,.]*","")

Not sure about that one -- putting square brackets inside square brackets is ambiguous.

Cheers

Top
#318110 - 13/01/2009 23:13 Re: bash scripting (xml parsing) help... (mainly awk and sed) [Re: hybrid8]
mlord
carpal tunnel

Registered: 29/08/2000
Posts: 14482
Loc: Canada
Originally Posted By: hybrid8
Bitt, is your example supposed to say "blank" ?

Yes. But you could do it with real blanks, tabs, and newlines if you really wanted to.
Code:
       [:space:]  Space characters (such as space, tab, and formfeed, to name a few).


Edited by mlord (13/01/2009 23:14)

Top
Page 1 of 3 1 2 3 >