#318069 - 13/01/2009 18:03
bash scripting (xml parsing) help... (mainly awk and sed)
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
I wanted a solution to grab trailers from Apple so that I can feed them into my SageTV media center and browse them within its library. Since I haven't been able to find a ready-made solution I thought I'd need to cook something up myself. But, I managed to find a bash script written by someone who uses a different media center program. It works but did a few things I didn't like to the filenames of the trailers, so I've been modifying it. I'm running the script from cygwin on a windows box, so I don't have a full set of tools that might otherwise be installed on something like my web server (or I would have likely tried redoing the whole thing in PHP. The script is below (I've commented out some functional lines to allow me to test the values of some variables (the part I'm having issues with right now). Here's also a link to the original: http://forum.team-mediaportal.com/plugins-47/mytrailers-42622/index11.html#post291349Basically you set a path to store a DB which keeps track of what's already been downloaded and another path to store the files. You set a parameter which tells the script whether to try and get 1080 versions. It reads in the XML for Apple's trailer RSS feeds and parses out the filenames and other fields. I am trying to modify the original to create a folder for each trailer and name that folder and the trailer according to the name of the movie. That is, instead of the original filename which doesn't have any spaces and may contain extra characters like "tlra_640w" etc... I don't know very much about how to use awk or sed, nor much about scripting with bash, those are the reasons I'm posting. Main issue at the moment is being able to parse the fields grabbed from the XML. They're created in TRAILERS and used to have a semicolon between them. I've changed this to a ";field" to see if I can pick out this field separator from any other legitimate use f semicolon within the data. But it's still failing to get the name "Angels & Demons" which is the first movie name with a space and a semicolon within it. We'll move to the subject of re-encoding the & and similar later. This line specifically:
MOVIETITLE=`echo $MOVIE | awk 'BEGIN { FS = ";field" } ; { print $2 }'`
Is only bringing back the word "Angels" from the above. Probably failing because of the space, but I don't know how to make it bring back its results enclosed in quotes.
#!/bin/bash
GET1080p=0
GETPOSTER=1
SAVEPATH="v:/Movies/zzztrailertest/"
DLDBPATH="d:/AppleTrailers/"
FEEDS="http://www.apple.com/trailers/home/xml/current_720p.xml http://www.apple.com/trailers/home/xml/current.xml"
tail -5000 $DLDBPATH.downloaded.db > $DLDBPATH.downloaded.db.tmp
mv $DLDBPATH.downloaded.db.tmp $DLDBPATH.downloaded.db
for FEEDURL in $FEEDS; do
TRAILERS=`xml sel --net -D -T -t -m "/records/movieinfo"\
-v "@id" -o ";field"\
-v "info/title" -o ";field"\
-v "info/postdate" -o ";field"\
-v "preview/large" -o ";field"\
-v "poster/xlarge"\
-n $FEEDURL`
for MOVIE in $TRAILERS; do
MOVIEID=`echo $MOVIE | awk 'BEGIN { FS = ";field" } ; { print $1 }'`
MOVIETITLE=`echo $MOVIE | awk 'BEGIN { FS = ";field" } ; { print $2 }'`
MOVIETITLEFILE=`echo $MOVIETITLE |sed 's/.*\///'`
#temporary output to show grabbed title
echo "=======##### Title: $MOVIETITLE -----------------"
POSTDATE=`echo $MOVIE | awk 'BEGIN { FS = ";field" } ; { print $3 }'`
BEXTENSION="[Trailer].mov"
PREVIEW=`echo $MOVIE | awk 'BEGIN { FS = ";field" } ; { print $4 }'`
PREVIEWFILE=`echo $PREVIEW |sed 's/.*\///' |sed 's/\.mov$/.hdmov/g'`
NEWPREVIEWNAME="$MOVIETITLE $BEXTENSION"
PREVIEW1080p=`echo $MOVIE | awk 'BEGIN { FS = ";field" } ; { print $4 }' |sed 's/a720p\.mov$/h1080p.mov/g'`
PREVIEWFILE1080p=`echo $PREVIEW1080p |sed 's/.*\///' |sed 's/\.mov$/.hdmov/g'`
NEWPREVIEWNAME1080p="$MOVIETITLE $PREVIEWFILE1080p"
POSTER=`echo $MOVIE | awk 'BEGIN { FS = ";field" } ; { print $5 }'`
NEWPOSTERNAME="folder.jpg"
MOVIESAVEPATH="$SAVEPATH$MOVIETITLE/"
#if ! grep -q "###$MOVIEID.PREVIEW" $DLDBPATH.downloaded.db; then
# mkdir $MOVIESAVEPATH
#fi
if [ "$GET1080p" -eq "1" ]; then
if `echo $FEEDURL | grep -q 720p`; then
if ! grep -q "###$MOVIEID.PREVIEW" $DLDBPATH.downloaded.db; then
# wget -c -O "$MOVIESAVEPATH$NEWPREVIEWNAME1080p" $PREVIEW1080p; PREVIEWOUT1080p=$?
if [ $PREVIEWOUT1080p -eq 0 ]; then
echo "###$MOVIEID.PREVIEW $NEWPREVIEWNAME1080p" >> $DLDBPATH.downloaded.db
else
echo "##### ID:$MOVIEID URL:$PREVIEW1080p FAILED -- TRYING ORIGINAL 720p URL NEXT"
fi
fi
fi
fi
if ! grep -q "###$MOVIEID.PREVIEW" $DLDBPATH.downloaded.db; then
# wget -c -O "$MOVIESAVEPATH$NEWPREVIEWNAME" $PREVIEW; PREVIEWOUT=$?
if [ $PREVIEWOUT -eq 0 ]; then
echo "###$MOVIEID.PREVIEW $NEWPREVIEWNAME" >> $DLDBPATH.downloaded.db
else
echo "##### ID:$MOVIEID URL:$PREVIEW FAILED -- RETRY NEXT RUN"
fi
else
echo "##### ID:$MOVIEID NAME:$NEWPREVIEWNAME MARKED DONE -- SKIPPING"
fi
if [ "$GETPOSTER" -eq "1" ]; then
if ! grep -q "###$MOVIEID.POSTER" $DLDBPATH.downloaded.db; then
# wget -c -O "$MOVIESAVEPATH$NEWPOSTERNAME" $POSTER; POSTEROUT=$?
if [ $POSTEROUT -eq 0 ]; then
echo "###$MOVIEID.POSTER $NEWPOSTERNAME" >> $DLDBPATH.downloaded.db
else
echo "##### $ID:$MOVIEID URL:$POSTER FAILED -- RETRY NEXT RUN"
fi
else
echo "##### ID:$MOVIEID NAME:$NEWPOSTERNAME MARKED DONE -- SKIPPING"
fi
fi
done
done Sample XML file (this is what it's parsing when it hits Apple's feed): http://mypocket.com/current_720p.xml.zipAs mentioned, I'll also need to clean up the results by re-encoding things like & back to "&" and I have no idea how to do that from this script. In PHP I can decode those using a function call to html_entity_decode(). One of the remaining things to look at is I don't know if the usage of sed that's specified when grabbing the name is sufficient. Because I'm using the movie name to create a folder and a file, I can't have things like colons, slashes or other invalid characters be used. I can't guess what will be coming up in future movie names, so it's possible that invalid characters may be present either originally as plain text or from decoding any html entities (such as greater-than or less-than).
|
Top
|
|
|
|
#318070 - 13/01/2009 18:05
Re: bash scripting (xml parsing) help...
[Re: hybrid8]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31607
Loc: Seattle, WA
|
Brilliant idea. I'd love to have the current apple trailers page listed in my DVR's menu. Hm, my new DVR is networkable, maybe I can set that up somehow too.
|
Top
|
|
|
|
#318071 - 13/01/2009 18:08
Re: bash scripting (xml parsing) help...
[Re: tfabris]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
While this script can be modified to make some type of textual listing with links to individual trailers, I'm actually trying to download them all. This will run daily (cron/schedule) to keep me up to date. I'll eventually come up with something to allow me to expire or remove trailers. Though SageTV does allow me to perform deletions right from its UI.
|
Top
|
|
|
|
#318076 - 13/01/2009 18:24
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
That one line seems to work correctly for me. What does $MOVIE contain at that point?
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318078 - 13/01/2009 18:46
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: wfaulk]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
I'll have to dump out the value of $MOVIE, but $MOVIETITLE contains only 'Angels' whereas I'd like it to contain 'Angels & Demons'
|
Top
|
|
|
|
#318079 - 13/01/2009 19:02
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Ok, $MOVIES doesn't contain the necessary information, so the problem is before the awk to $MOVIETITLE
$MOVIE immediately after the beginning of the for loop contains only the ID and the movie name up to the first word Angels.
The next time it passes through the loop it continues from where it left off, producing invalid results where the $MOVIE var contains only an ampersand
$TRAILERS does contain everything as I expected it to be. The ID, full movie title that already appears to have the ampersand converted from an html entity, date and the rest of the info, for every movie in the XML file.
I'm at least as stuff as I was originally though, having no idea why $MOVIE doesn't contain what I'd expect it to (the full contents of a single "row").
|
Top
|
|
|
|
#318080 - 13/01/2009 19:09
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Hmmm.. I'm the one that added the movie title to the collection of fields with this line:
-v "info/title" -o ";field"\
Then I obviously had to adjust the position of the fields extracted. Before I made that change, none of the fields captured from the XML contained spaces. Maybe I should encode the space so that it isn't actually a space? It seems like the space is what's causing the termination of the MOVIE variable in the FOR.
Edited by hybrid8 (13/01/2009 19:11)
|
Top
|
|
|
|
#318081 - 13/01/2009 19:12
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 13/07/2000
Posts: 4181
Loc: Cambridge, England
|
Maybe I should encode the space so that it isn't actually a space? It seems like the space is what's causing the termination of the MOVIE variable in the FOR. Yes, by default "for" splits words at any whitespace (space, tab, newline). To stop it doing that -- to make it split words at newlines only -- set the shell variable IFS to "\n".
$ cat > hybrid8.txt
a b
c
$ for i in `cat hybrid8.txt` ; do echo $i ; done
a
b
c
$ IFS="\n"
$ for i in `cat hybrid8.txt` ; do echo $i ; done
a b
c
$
Peter
|
Top
|
|
|
|
#318082 - 13/01/2009 19:13
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
Oh. It probably has nothing to do with the ampersand and everything to do with the space.
"for ... in" splits on whitespace, not newline. Before I spend a lot of time going down this road, delete the "Angels & Demons" line and see if it breaks similarly on "Astro Boy".
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318083 - 13/01/2009 19:14
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: peter]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318085 - 13/01/2009 19:39
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: wfaulk]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Can anyone offer a suggestion as to why setting IFS='\n' causes it to use the "n" character as the field separator? I saw another suggestion elsewhere to use a WHILE loop instead of a FOR to avoid changing IFS default. EDIT: Setting it like this Seems to work. Now I'm just doing some digging to best be able to clean the names before creating files or folders from them (no slashes, colons, gt, lt, etc..) Does anyone already have a suitable script or pointer to something somewhat universal for this?
Edited by hybrid8 (13/01/2009 20:45)
|
Top
|
|
|
|
#318090 - 13/01/2009 21:06
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14503
Loc: Canada
|
Awk can do that rather trivially:
echo "anythingyouwant234234" | awk '{gsub("[^a-z]","");print}'
Just add your list of permitted characters/ranges into the "a-z" regular expression. It's much easier to list permitted stuff than to try and enumerate all of the forbidden characters.
Cheers
|
Top
|
|
|
|
#318091 - 13/01/2009 21:09
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: mlord]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14503
Loc: Canada
|
Awk can do that rather trivially:
echo "anythingyouwant234234" | awk '{gsub("[^a-z]","");print}' So, for example, from a shell script one could use this sequence: NASTYNAME="whatever.. "
GOODNAME=`echo "$NASTYNAME" | awk '{gsub("[^- a-zA-Z0-9_$+]*","");print}'` EDIT: fixed some issues above nowThat's probably a good start at it.
Edited by mlord (13/01/2009 21:12)
|
Top
|
|
|
|
#318092 - 13/01/2009 21:11
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: mlord]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Thanks Mark.
How about having SOME forbidden characters so that I can replace them with specific alternatives? I suppose I could do that before passing the result to the awk example you posted (making sure to include whatever my alternatives are in the list of allowed characters).
In your example, anything not defined in the substitution is replaced with "" correct?
So if I create another command and I omit the ^ which is a NOT if I recall, I can include a list of characters to be replaced.
|
Top
|
|
|
|
#318093 - 13/01/2009 21:13
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14503
Loc: Canada
|
Sure thing.. just start with the corrected version I just fixed.
|
Top
|
|
|
|
#318094 - 13/01/2009 21:15
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: mlord]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14503
Loc: Canada
|
You can add more gsub calls inside the single awk invocation. For example, this version replaces all spaces with underscores, and then does the regular vetting of the result: NASTYNAME="whatever.. "
GOODNAME=`echo "$NASTYNAME" | awk '{gsub(" ","_"); gsub("[^- a-zA-Z0-9_$+]*","");print}'` EDIT: also note that, if a dash - character is wanted inside the [] expression, it has to be FIRST, or immediately after the negation ^ character if present.
Edited by mlord (13/01/2009 21:19)
|
Top
|
|
|
|
#318095 - 13/01/2009 21:18
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: mlord]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Ok, great. I should be able to get what I need with this. Just have to figure out how to extend the ranges in the last sub to include more valid characters like brackets, etc...
If I want to include a quote can I escape it with a backslash? Like so ' \" '
And will I be able to easily include a single quote (actually a normal ascii apostrophe) with "'" ?
How about allowing square brackets? Need to be escaped I suppose?
Lastly, is the space after the dash you mentioned there to delimit that dash from the rest of the characters? So if I want to include a space character as allowed, can I add it anywhere within the [] ?
Edited by hybrid8 (13/01/2009 21:23)
|
Top
|
|
|
|
#318096 - 13/01/2009 21:22
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14503
Loc: Canada
|
Ok, great. I should be able to get what I need with this. Just have to figure out how to extend the ranges in the last sub to include more valid characters like brackets, etc...
If I want to include a quote can I escape it with a backslash? Like so ' \" ' Yeah, except it can get very messy and confusing because the shell itself may also try to interpret some things like that before passing the strings to awk. So some double escapes might be needed. It's because of that fuss, that I normally would just do the whole script as an awk script rather than a bash script. No double escaping needed, and a heck of a lot less confusing. Awk is kinda nice for this stuff, with its C-like control structures and free typing of things. But difficult to remember if one only uses it once a year or less. I use it weekly here -- my fav programming language! Cheers
Edited by mlord (13/01/2009 21:34)
|
Top
|
|
|
|
#318097 - 13/01/2009 21:30
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: mlord]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14503
Loc: Canada
|
For example, one could separate out the filename sanity stuff into it's own script in a separate file, like this: EDIT: added some html escapes for fun. EDIT: fixed the ordering of a few things.#!/usr/bin/gawk -f
{
## spaces to underscores:
gsub(" ","_")
## some html escapes:
gsub(">",">")
gsub("<","<")
gsub(""","\"")
gsub("&","\\&")
## square brackets into round brackets:
gsub("\\[","(")
gsub("\\]",")")
## double-quotes into apostrophes:
gsub("\"","'")
## sanitize the rest:
gsub("[^- 'a-zA-Z0-9_$+&<>]*","")
## dump it to stdout
print
}
Which could be saved as sanitize.awk, and then be used like this: NASTYNAME="whatever.. "
GOODNAME=`echo "$NASTYNAME" | sanitize.awk`
Edited by mlord (13/01/2009 21:46)
|
Top
|
|
|
|
#318098 - 13/01/2009 21:48
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: mlord]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
What's up with the double escaping on the square brackets? I think this is the solution I'll use - the cleaner as its own file. Some additional q's... Do I need to escape out smart quotes (singles and doubles) or should I instead specify them some other way? How about forward slash? I'm assuming backward slash just needs a single escape like so \\. I'd like to convert the slashes to dashes, so I need to account for them, otherwise I wouldn't bother and just leave it to the last sub to drop them. Damn, there's just a boatload of other characters I want to allow as well. Can the following be specified plainly (without escaping)... period, comma, colon, semicolon, question mark and the curly brackets { - and how about the shifted characters above the numerals with the exception or asterisk and carat?
|
Top
|
|
|
|
#318099 - 13/01/2009 21:57
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14503
Loc: Canada
|
Many of those have special meaning inside a regular expression, so I suggest you try them one at a time when in doubt. The backslash will need to become a foursome (\\\\), and other known special characters include the dot, asterisk, dollar-sign, etc.. Here's part of the manpage: Regular Expressions
Regular expressions are the extended kind found in egrep.
They are composed of characters as follows:
c matches the non-metacharacter c.
\c matches the literal character c.
. matches any character including newline.
^ matches the beginning of a string.
$ matches the end of a string.
[abc...] character list, matches any of the characters abc....
[^abc...] negated character list, matches any character except abc....
r1|r2 alternation: matches either r1 or r2.
r1r2 concatenation: matches r1, and then r2.
r+ matches one or more r’s.
r* matches zero or more r’s.
r? matches zero or one r’s.
(r) grouping: matches r.
r{n}
r{n,}
r{n,m} One or two numbers inside braces denote an interval expression. If there is one number
in the braces, the preceding regular expression r is repeated n times. If there are
two numbers separated by a comma, r is repeated n to m times. If there is one number
followed by a comma, then r is repeated at least n times.
Interval expressions are only available if either --posix or --re-interval is specified
on the command line.
\y matches the empty string at either the beginning or the end of a word.
\B matches the empty string within a word.
\< matches the empty string at the beginning of a word.
\> matches the empty string at the end of a word.
\w matches any word-constituent character (letter, digit, or underscore).
\W matches any character that is not word-constituent.
\‘ matches the empty string at the beginning of a buffer (string).
\’ matches the empty string at the end of a buffer.
The escape sequences that are valid in string constants (see below) are also valid in regular
expressions.
Edited by mlord (13/01/2009 21:59)
|
Top
|
|
|
|
#318100 - 13/01/2009 22:06
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: mlord]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14503
Loc: Canada
|
Here are some more examples: ## backquotes become apostrophes:
gsub("`","'")
## backslashes become dashes:
gsub("\\\\","-")
## slashes, question marks, carats, dollarsigns become underscores:
gsub("[/?^$]","_")
That last one above shows one way to deal with characters you aren't sure about: enclose them inside square brackets and they are no longer special (except for backslashes, dashes, or a leading carat).
|
Top
|
|
|
|
#318103 - 13/01/2009 22:46
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: mlord]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Ok, here's what I have so far (I have yet to test this):
#!/usr/bin/gawk -f
{
## some html escapes:
gsub(">",">")
gsub("<","<")
gsub(""","\"")
gsub("&","\\&")
## replace fancy "smart" quotes with straight equivalents
gsub("’","'")
gsub("‘","'")
gsub("“","\"")
gsub("”","\"")
## backquote to apostrophe
gsub("`","'")
## double quote to apostrophe
gsub("\"","'")
## select illegal filename characaters replaced by alternates (other illegal characters just dropped later)
gsub(">",")")
gsub("<","(")
gsub("[:]"," - ")
gsub("[/]","-")
## backslash to dash
gsub("\\\\","-")
## double space to single space:
gsub(" "," ")
## sanitize the rest:
gsub("[^- 'a-zA-Z0-9 _$+&={}\\[\\]()%@!;,.]*","")
## dump it to stdout
print
}
What's an easy way to strip leading and trailing whitespace? That's about all that's left to do (just in case, but strictly for beautifying).
|
Top
|
|
|
|
#318104 - 13/01/2009 23:02
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Ok, it seems to work except that the smart quotes stuff I have isn't actually matching what's in the source. In the cygwin shell the offending characters come up as "â?T" which is sort of meaningless. Seems like a case of UTF characters... If I pipe the output to a file and then open it in a UTF-capable text editor on my Mac then they come up as the normal smart characters. If I save the awk file as UTF8 then it breaks when piped from the batch file. Is there a proper way to be able to use UTF8 in bash and awk? This of course also reminds me that I have to include accented characters as valid. I should have known this was going to get hairier...
|
Top
|
|
|
|
#318105 - 13/01/2009 23:03
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
gsub("^[[:blank:]]*", "") gsub("[[:blank:]]*$", "")
I hate POSIX regexes.
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318106 - 13/01/2009 23:09
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
Is there a proper way to be able to use UTF8 in bash and awk? HAHAHAHAHAHA Uh, maybe. If your version of gawk is recent enough and has the right support built in, and you can set your LC_ALL and/or LANG environment variables to "en_US.UTF-8" (or something similar to that; your politics might require you to use "en_CA.UTF-8"), you might get it to work. Or you could just use perl (or Tcl or Ruby or Python or Forth or Haskell or whatever your pet language might be) instead.
Edited by wfaulk (13/01/2009 23:10) Edit Reason: americentrism
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318107 - 13/01/2009 23:10
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14503
Loc: Canada
|
Is there a proper way to be able to use UTF8 in bash and awk? Dunno. But if they're single-byte characters, then find their hexcodes and use: String Constants
String constants in AWK are sequences of characters enclosed between double quotes ("). Within
strings, certain escape sequences are recognized, as in C. These are:
\\ A literal backslash.
\a The “alert” character; usually the ASCII BEL character.
\b backspace.
\f form-feed.
\n newline.
\r carriage return.
\t horizontal tab.
\v vertical tab.
\xhex digits
The character represented by the string of hexadecimal digits following the \x. As in ANSI
C, all following hexadecimal digits are considered part of the escape sequence. (This fea‐
ture should tell us something about language design by committee.) E.g., "\x1B" is the ASCII
ESC (escape) character.
\ddd The character represented by the 1-, 2-, or 3-digit sequence of octal digits. E.g., "\033"
is the ASCII ESC (escape) character.
\c The literal character c.
The escape sequences may also be used inside constant regular expressions (e.g., /[ \t\f\n\r\v]/
matches whitespace characters).
In compatibility mode, the characters represented by octal and hexadecimal escape sequences are
treated literally when used in regular expression constants. Thus, /a\52b/ is equivalent to
/a\*b/.
|
Top
|
|
|
|
#318108 - 13/01/2009 23:12
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: wfaulk]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Bitt, is your example supposed to say "blank" ?
|
Top
|
|
|
|
#318109 - 13/01/2009 23:12
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14503
Loc: Canada
|
gsub("[^- 'a-zA-Z0-9 _$+&={}\\[\\]()%@!;,.]*","") Not sure about that one -- putting square brackets inside square brackets is ambiguous. Cheers
|
Top
|
|
|
|
#318110 - 13/01/2009 23:13
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14503
Loc: Canada
|
Bitt, is your example supposed to say "blank" ? Yes. But you could do it with real blanks, tabs, and newlines if you really wanted to. [:space:] Space characters (such as space, tab, and formfeed, to name a few).
Edited by mlord (13/01/2009 23:14)
|
Top
|
|
|
|
#318111 - 13/01/2009 23:13
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: mlord]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
gsub("[^- 'a-zA-Z0-9 _$+&={}\\[\\]()%@!;,.]*","") Not sure about that one -- putting square brackets inside square brackets is ambiguous. Ok, that's fair. So how do I allow them otherwise?
|
Top
|
|
|
|
#318112 - 13/01/2009 23:15
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
Bitt, is your example supposed to say "blank" ? Yes. And if you're going to deal with Unicode, then you probably want to use more of those character classes. Assuming it will deal with Unicode, you can't assume that "[a-z]" includes all lowercase characters. What about "ö"?
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318113 - 13/01/2009 23:16
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14503
Loc: Canada
|
gsub("[^- 'a-zA-Z0-9 _$+&={}\\[\\]()%@!;,.]*","") Not sure about that one -- putting square brackets inside square brackets is ambiguous. Ok, that's fair. So how do I allow them otherwise? Oh, my apologies.. you already have them properly backslashed. Note that, inside a [] construct, you can simply use [ instead of \\[, but still need to do \\]
Edited by mlord (13/01/2009 23:18)
|
Top
|
|
|
|
#318114 - 13/01/2009 23:17
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
To include one of the characters ‘\’, ‘]’, ‘-’, or ‘^’ in a character list, put a ‘\’ in front of it. Those are the only oddball characters in a character class. Note that you can use '[' unescaped.
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318115 - 13/01/2009 23:38
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: wfaulk]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Assuming it will deal with Unicode, you can't assume that "[a-z]" includes all lowercase characters. What about "ö"? I did mention that on a post I edited on the first page of the thread. That I needed to support accented character variants. The manual page you linked isn't specific about whether [:alnum:] includes é, å, ñ, etc.. It does mention that you can use an equivalence class for accented characters, but then also says that the regexp matching in awk doesn't support equivalence classes. Then there are other characters that are part of foreign alphabets that are valid within filenames which can conceivably be used in the movie names listed on Apple's site. Such as ß, œ and others. In the future I'd like to break out extended information and full text naming into a metadata file which will be used by the application (SageTV in my case) and can have the filenames completely void of all these special cases. I'm not at that stage of integration yet and will still need to install some mods on my PVR to make use of any metadata files I create.
|
Top
|
|
|
|
#318116 - 13/01/2009 23:50
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
If you need to deal in-depth with Unicode, ditch awk and get something that is designed to handle Unicode. Seriously. awk barely handles 8-bit characters, much less variable length ones.
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318117 - 14/01/2009 00:09
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
I've tried gsub("\u2019","'") for the curved single quote (number is the unicode hex value) but no success. I read about that on some page but my version of awk reports it's treated as a normal u, and not an escape for a unicode hex string.
I have yet to try \x and specify individual bytes.
I've verified that [:alnum:] is working for the characters that were caught by a-zA-Z0-9. But it's not working for accented characters.
I don't know if it's because the text is butchered before getting to awk or what. But awk is likely not seeing the unicode portions as multiple proper ascii characters, otherwise I'd have extra characters passing through instead of the accented ones simply being dropped. Arrgh.
|
Top
|
|
|
|
#318118 - 14/01/2009 00:34
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Ok, I now have cygwin supporting UTF-8 to some extent using the following patch: http://www.okisoft.co.jp/esc/utf8-cygwin/So I can at least see the UTF-8 characters properly in the console now. It says it will properly use UTF-8 for file IO as well. But the awk class I mentioned above is clearly dropping accented characters. After saving the awk file itself as UTF-8 (NO BOM! which is important) I can now properly match on UTF-8 characters typed into the file, such as ’ Only the accented characters left to figure out...
Edited by hybrid8 (14/01/2009 00:40)
|
Top
|
|
|
|
#318119 - 14/01/2009 01:14
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: wfaulk]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
As I mentioned, I'd love to ditch this whole thing and just write something in straight-up ansi-C which would likely be best since it wouldn't rely on external programs or interpreters like some other alternatives (though I'd need to link in some libraries). I could also go with PHP but that means installing a lot of stuff on the machine I don't otherwise have a need for. But PHP does have a lot of built-in functions to make the unicode and html entities more of a breeze.
It all also means redoing this whole thing, including the xml parsing bits which thus far I've just stolen from the existing script.
I think my best bet, since the input is relatively trust-worthy, is to just filter out the characters that are not allowed and just let everything else go. This is the opposite of what we've been discussing. Seems easy enough because there are only 9 characters not allowed in a Windows file name. And most of those were already included in the awk sample I posted.
I wanted to give one last go at setting awk up for an alternate language, but no matter how much searching I've done I can't find out how to set the environment variables. Perhaps that information is so basic that no one talks about it. Lots of discussion about the variables, but I have no idea where to put them. Doesn't seem to work just stuffing them into the script.
|
Top
|
|
|
|
#318128 - 14/01/2009 02:14
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
Most likely the version of awk that you have simply doesn't have support for Unicode at all.
Virtually every interpreted language you can think of using (Perl, PHP, Python) is going to have an XML parser. Given, that xmlstarlet program you're using is pretty handy.
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318136 - 14/01/2009 02:57
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: wfaulk]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Well, it's all working as well as it needs to be without changing the environment variables.
Patching cygwin for UTF-8 and saving the scripts as UTF-8 NO BOM allowed me to use UTF-8 characters. The xmlstarlet seems to pick up certain HTML entities and converts them automatically too. & already comes back as &.
By using gawk to replace characters, including those that are invalid in filenames, I'm left with a string that will work for what I need (file names, folder names and maintaining full legibility).
If this were something I was going to distribute then I'd go about it differently. Or if it was something I needed to host remotely, I'd definitely do it in PHP. I did find a nice PHP function that parses an XML document into a nice (and potentially large) array.
Thanks a lot for your help guys! I don't think I would have been able to get through all this without it.
|
Top
|
|
|
|
#318148 - 14/01/2009 10:12
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 18/01/2000
Posts: 5685
Loc: London, UK
|
I'm running the script from cygwin on a windows box You're on Windows? You're parsing XML? Use PowerShell. It supports XPath and XSLT (because it's .NET).
_________________________
-- roger
|
Top
|
|
|
|
#318154 - 14/01/2009 13:07
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: Roger]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
I think since it's all working, I'll stick with BASH and avoid having to learn a completely different scripting platform. I also don't know jack about XPath and XSLT. Which brings me to one additional problem. xmlstarlet is definitely handy, but I can't seem to figure out how to easily extract particular pieces of information. Here's a sample of the originating XML (somewhat formatted): <?xml version="1.0" encoding="utf-8"?> <records date="Tue, 13 Jan 2009 00:51:04 -0800"> <movieinfo id="2898"> <info> <title>12</title> <runtime>2:00</runtime> <rating>PG-13</rating> <studio>Sony Pictures Classics</studio> <postdate>2008-11-04</postdate> <releasedate>2009-03-04</releasedate> <copyright>© Copyright 2009 Sony Pictures Classics</copyright> <director>Nikita Mikhalkov</director> <description>12 characters. 12 truths. The story of 12 jurors discussing a verdict to pass on an 18 year old Chechen boy whether he is guilty of 1st degree murder of his step-father — an officer of the Russian army. The film thinks aloud about today;s life, about the need to hear the next of kin and help that person before its too late. The action of the picture unveils in one room — a gym adjusted for jury deliberations.</description> </info> <cast> <name>Sergei Makovetsky</name> <name>Nikita Mikhalkov</name> </cast> <genre> <name>Drama</name> <name>Foreign</name> </genre> <poster> <location>http://images.apple.com/moviesxml/s/sony/posters/12_l200811041428.jpg</location> <xlarge>http://images.apple.com/moviesxml/s/sony/posters/12_xl200811041428.jpg</xlarge> </poster> <preview> <large filesize="55098535">http://movies.apple.com/movies/sony/12/12_a720p.mov</large> </preview> </movieinfo> <movieinfo id="2904"> <info> <title>Angels & Demons</title> <runtime>1:10</runtime> <rating>Not yet rated</rating> <studio>Sony Pictures</studio> <postdate>2008-11-06</postdate> <releasedate>2009-05-15</releasedate> <copyright>© Copyright 2009 Sony Pictures</copyright> <director>Ron Howard</director> <description>The team behind the global phenomenon The Da Vinci Code returns for the highly anticipated Angels & Demons, based upon the bestselling novel by Dan Brown. Tom Hanks reprises his role as Harvard religious expert Robert Langdon, who once again finds that forces with ancient roots are willing to stop at nothing, even murder, to advance their goals. Ron Howard again directs the film, which is produced by Brian Grazer, Ron Howard, and John Calley. The screenplay is by Akiva Goldsman and David Koepp. When Langdon discovers evidence of the resurgence of an ancient secret brotherhood known as the Illuminati - the most powerful underground organization in history - he also faces a deadly threat to the existence of the secret organization’s most despised enemy: the Catholic Church. When Langdon learns that the clock is ticking on an unstoppable Illuminati time bomb, he jets to Rome, where he joins forces with Vittoria Vetra, a beautiful and enigmatic Italian scientist. Embarking on a nonstop, action-packed hunt through sealed crypts, dangerous catacombs, deserted cathedrals, and even to the heart of the most secretive vault on earth, Langdon and Vetra will follow a 400-year-old trail of ancient symbols that mark the Vatican’s only hope for survival. </description> </info> <cast> <name>Tom Hanks</name> <name>Ewan McGregor</name> <name>Ayelet Zurer</name> <name>Stellan Skarsgård</name> <name>Pierfrancesco Favino</name> </cast> <genre> <name>Drama</name> <name>Thriller</name> </genre> <poster> <location>http://images.apple.com/moviesxml/s/sony_pictures/posters/angelsdemons_l200811061144.jpg</location> <xlarge>http://images.apple.com/moviesxml/s/sony_pictures/posters/angelsdemons_xl200811061144.jpg</xlarge> </poster> <preview> <large filesize="32237184">http://movies.apple.com/movies/sony_pictures/angelsanddemons/angelsanddemons-tlr1_a720p.mov</large> </preview> </movieinfo> </records>
As you can see I've shown the sample using only two movies to keep it short. What I'd like to do is extract a single movie and throw it into its own XML file. The output would look pretty much as above except containing only one specific movie. However, I also want to preserve the "records" every time I do this. I've already played around with some extraction and can handily extract all movies or specific elements/attributes of all movies, but I need to target in on one at a time either by matching on their ID or by specifying an integer corresponding to the nth movie (I can figure out how to get a total count as well as keep a running total in the script to reference the "current" movie)
|
Top
|
|
|
|
#318168 - 14/01/2009 19:28
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Ok, I've mostly solved this myself with a bunch of overhead and definitely not very much finesse. But so far I'm getting the datapoints I need and in a format I can use to extract the final bit I want. I had to continue to use other facilities from the bash script, including awk and some hard-coded echos since I was unable to figure out how to use only xmlstarlet. Only one issue left...
SELECTMOVIE=`echo $TRAILERS | awk 'BEGIN { FS = "--DIVIDER--" } ; { print $7 }'`
How do I substitute that "$7" in the print statement with a variable? EDIT: Solved.
movieNum=7
SELECTMOVIE=`echo $TRAILERS | awk -v record=$movieNum 'BEGIN { FS = "--DIVIDER--" } ; { print $record }'`
The -v argument allows you to pick up external variables and assign them to variables within awk. I suppose since I'm here I might also pose another question which could be useful to me. xmlstarlet has a formatting option which takes in a file, cleans it up and spits it out to stdout like so: You can obviously redirect that to a file if you'd like. I'm looking for a way to supply it with XML from a variable in bash without having to first save out the contents of the var to a file. What I need is to have a single file in the end that has been properly formatted. I suppose I can save out a temporary file, format it out to the final file and then delete the temporary one. I was just hoping there was some way I could save that step or at least something less manual.
Edited by hybrid8 (14/01/2009 19:58)
|
Top
|
|
|
|
#318178 - 14/01/2009 22:03
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
Unix CLI tradition is to accept input from stdin if a filename argument is given as "-". So, I don't know that this works, but it's worth trying "echo $XMLDATA | xml format -". You might also try it with no filename argument at all: "echo $XMLDATA | xml format".
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318185 - 14/01/2009 22:25
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: wfaulk]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Well, it turns out this isn't working once I put it all into the existing loop. It keeps hanging and I thought it was the IFS... So removed it and replaced the for with a more traditional for using a counter. This seems to let it loop longer. I'm testing without the wgets to the server and just pulling the XML data. I can loop a full count of 86 items without an issue if I'm just tossing the data into a variable as I pasted above. This was causing a hang before with the other loop. But now if I put back the code to echo the contents of that variable to files (a different file for every pass of the loop) it hangs again well before completing. It seems to hang at a random point - a different pass each time I run it. Am I looping too fast and trying to create too many files too quickly? I just tried with a "sleep 1" after the file output and the first time it got through 13 passes of the loop. The second time 27 passes. This is getting really annoying.
|
Top
|
|
|
|
#318187 - 14/01/2009 22:51
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Arrgh, now I'm getting really pissed at this... Ok, I thought I had just seen something interesting, but it seems just like more randomness. It will hang even without the redirect to a file. I managed to get all passes done in the loop while outputting to the console (plain echo as shown above) but on future runs it hangs at random points.
|
Top
|
|
|
|
#318202 - 15/01/2009 16:11
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31607
Loc: Seattle, WA
|
it hangs at random points. Is it pulling the data live from the web each time through the loop? Maybe just TCP timeouts.
|
Top
|
|
|
|
#318203 - 15/01/2009 16:13
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: tfabris]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Not TCP timeouts because I'm currently testing by pulling the data from a local file. And the data is only pulled at the beginning of the script to create variables that hold it for later use.
This is super frustrating...
|
Top
|
|
|
|
#318204 - 15/01/2009 16:31
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31607
Loc: Seattle, WA
|
Maybe something is wrong with the particular bash interpreter version installed on that machine? Something about your script exercising the interpreter's memory manager in an odd way.
|
Top
|
|
|
|
#318205 - 15/01/2009 17:02
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: tfabris]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
I'm going to post the script in a little while, maybe some people will be kind enough to give it a shot to see if they also experience the same issue. I'll include the local file as well so it doesn't have the hit the network.
|
Top
|
|
|
|
#318206 - 15/01/2009 17:47
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Here's a cut-down version of the script which omits the parts that actually download the trailers but includes the parts that are causing the issue, the capturing of XML data on a per-movie basis. You'll have to define the paths at the top of the file appropriate for your system. SAVEPATH is where movie folders are created and movie data is saved if those lines are uncommented. SCRIPTPATH is the location of this script and the awk cleaner script (included in the attached zip file along with another copy of this main script and the XML data file). There are some lines commented out right now. The way it is now it will grab the metadata from the XML file into three variables. The TRAILERS variable is composed of select values from the XML file separated by ";field" and then each movie is separated by ";x545hwx1" - this is done so it's easy to pull values for individual movies when looping. If you refer to the original script you can see that the last separator used to be a plain newline, but for testing purposes I've gone to something more specific. The next variable, recordDATE contains only a single string pulled from the very top of the XML file containing the date the XML file was created. Then we have the movieFields variable which contains ALL fields for every movie. Individual fields are not separated by special markers, but each movie is separated by "--DIVIDER--" - this is the variable I use to pull the full metadata for each movie which I'd like to save out to a proper XML file, also for each movie (the line of code to do this is below and commented out). The movie data just mentioned is currently output to the console. If you comment out the echo line and instead enable the one above it, it will instead put the data into files, one per movie folder. You'll also have to enable the line that creates the movie name folders. This thing will hang for me at random times running this way or running with file creation. When i first started testing today it would complete the whole thing without a problem. I must have done it like 10 times in a row to the console as well as a few times outputting to files. Then it started hanging again at random points. No code was changed during these tests.
#!/bin/bash
movieRow=0
BEXTENSION=".trailer.mov"
GET1080p=0
GETPOSTER=1
SAVEPATH="v:/Movies/zzztrailertest/"
SCRIPTPATH="d:/AppleTrailers/"
FEEDURL="d:/AppleTrailers/current_720p.xml"
TRAILERS=`xml sel --net -D -T -t -m "/records/movieinfo"\
-v "@id" -o ";field"\
-v "info/title" -o ";field"\
-v "info/postdate" -o ";field"\
-v "preview/large" -o ";field"\
-v "poster/xlarge"\
-o ";x545hwx1"\
$FEEDURL`
recordDATE=`xml sel --net -D -T -t -m "/records"\
-v "@date"\
$FEEDURL`
movieFields=`xml sel --net -I -E utf-8 -t -m "/records/movieinfo"\
-c "."\
-o "--DIVIDER--"\
$FEEDURL`
for movieCOUNTER in `seq 1 86`; do
#sleep 1
MOVIE=`echo $TRAILERS | awk -v line=$movieCOUNTER 'BEGIN { FS = ";x545hwx1" } ; { print $line }'`
MOVIEID=`echo $MOVIE | awk 'BEGIN { FS = ";field" } ; { print $1 }'`
MOVIETITLE=`echo $MOVIE | awk 'BEGIN { FS = ";field" } ; { print $2 }'`
MOVIETITLEFILE=`echo "$MOVIETITLE" | $SCRIPTPATH/filecleaner.awk`
POSTDATE=`echo $MOVIE | awk 'BEGIN { FS = ";field" } ; { print $3 }'`
NEWPREVIEWNAME="$MOVIETITLEFILE$BEXTENSION"
POSTER=`echo $MOVIE | awk 'BEGIN { FS = ";field" } ; { print $5 }'`
NEWPOSTERNAME="folder.jpg"
MOVIESAVEPATH="$SAVEPATH$MOVIETITLEFILE"
#mkdir "$MOVIESAVEPATH/"
selectedMovie=`echo $movieFields | awk -v movieRecord=$movieCOUNTER 'BEGIN { FS = "--DIVIDER--" } ; { print $movieRecord }'`
#echo -e "<?xml version=\"1.0\" encoding=\"utf-8\"?>\r<records date=\"$recordDATE\">$selectedMovie</records>" >"$MOVIESAVEPATH/temp.xml"
echo -e "<?xml version=\"1.0\" encoding=\"utf-8\"?>\r<records date=\"$recordDATE\">$selectedMovie</records>"
done
Attachments
AppleTrailers.zip (152 downloads)
Edited by hybrid8 (15/01/2009 17:50) Edit Reason: adding attachment
|
Top
|
|
|
|
#318216 - 16/01/2009 02:45
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
Okay, I've been working on this for a while, and I've tidied it up, but I'm still not clear on what it is you're trying to do.
That said, it never seemed to fail for me, but here it is with the awk script embedded, using tabs and newlines as delimiters, and generally cleaned up. Added some comments so you can understand what's going on.
Attachments
AppleTrailerTEST.bash (193 downloads)
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318217 - 16/01/2009 03:05
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: wfaulk]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Thanks Bitt, I'll take a look at it in the morning.
The whole thing will provide me with up-to-date trailers, poster art and information for my PVR.
The script segment I posted is just grabbing the particulars of each movie from Apple's feed. The rest of the script (which was working fine) downloads the trailers. The whole thing together when run on a schedule will download any new trailer posted to Apple's trailer site, along with the poster for that movie and also its related metadata. These three things will be saved in a folder with the movie's name. The metadata will at some point be parsed by another tool which creates usable information for my PVR.
The script was just found, minus the new stuff that I was adding for the xml metadata saving. I used the oddball field separators because I wasn't sure whether the XML feed had any newlines in it and also because at some point I had some troubles escaping newlines, but I can't remember what problem I had anymore.
|
Top
|
|
|
|
#318221 - 16/01/2009 12:08
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
So far so good. I took a quick look at the script and then ran it to see if it was working. Needed only to change the xml program var at the top (since I have xmlstarlet in my PATH already).
Then to clean it up I had to change the character encoding for all the data extraction to UTF-8. ASCII uses only the lower 127 characters and won't support any accents nor the "smart" single and double quotes which may be present in the source data.
The loop structure that was in my sample was temporary only and in the full source I also had one which would only loop as many times as there were entries in the source XML.
Do you have any idea why the other version would hang? Could it be because of many (repeated) external calls to the awk file I had created?
Again, thanks for the brilliant help. You did an amazing job with the cleanup of the script. I could just understand the basics of what was going on in the original script, but I have no experience with bash syntax, so this is hugely appreciated.
Now I'm going to start adding back in the original trailer and poster downloading code and at the same time try to apply the same type of cleanup to those parts so they match what you've done.
I'll send an update when it's done and after I verify everything is working correctly I'll post it all again for review and of course for anyone that wants to use it.
|
Top
|
|
|
|
#318224 - 16/01/2009 14:29
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
Then to clean it up I had to change the character encoding for all the data extraction to UTF-8. ASCII uses only the lower 127 characters and won't support any accents nor the "smart" single and double quotes which may be present in the source data. That's for output. If you switch it back to ascii, you'll see that it understands the UTF8 input and prints &#nnnn; instead of the raw character, which is probably closer to where you need to be. I could be wrong, though.
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318228 - 16/01/2009 15:35
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: wfaulk]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
If I didn't take in utf-8 on the input (parse) then I'd have to introduce new code to utf-8-encode the data on output. Without an encode pass on output obviously we're left with the escaped multi-byte data as plain-text. So as I was adding back the rest of the content I discovered a few important things about the feeds and some deficiencies in the original script (again, I didn't write the original ) The original script was parsing through two feeds. A 720p feed and a feed for normal size videos (usually up to 640 wide). Each feed contains a PREVIEW section which contains keys to represent different sizes for the movie files. Both feeds use only the "LARGE" key however. It would have been nice if Apple had only a single feed and then used the size key to simply specify all the different file sizes available. Anyway, the script used the 720 feed to hard-code a filename substitution to try and guess a possible 1080 file. However the substitution was searching for "a720p.mov" and not all the included movies had that final extension. Some were specified as .m4v But there are no 1080 alternates with an m4v extension on the server.... I thought about doing some better file extension handling but then after some poking around on Apple's site I discovered that all files included in the feed were ALSO available with the a720.mov extensions. At this point it started to look like I could ignore the extension in the feed. The standard feed seems to be a superset of the 720 and as of today, contained a couple of movies not included on the 720. All movies in the standard feed end in h640w.mov I've changed the script to make it only look at the standard feed and from that to make substitutions for the correct extensions to obtain the HD files. I've duplicated the original 1080 conditional two times so the script can handle every different size uniquely, falling back from the highest to the lowest (vars allow enabling/disabling specific HD sizes): 1080p > 720p > 480p > standard 640w I can improve the logic but my biggest concern was to make sure the file capturing was working properly. From a functional point of view I'll have to keep watch to find out if files are ever introduced into the regular feed with a different extension other than h640w.mov - if so I'll have to include some logic instead of simple substitution to create the other filenames (otherwise it will never try to grab an HD version and will instead grab the standard version and simply name the output file with the HD filename). Next post will contain the script.
|
Top
|
|
|
|
#318229 - 16/01/2009 15:58
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
NEWER Apple Downloader Script.
#!/bin/bash
# Download Movie trailers from Apple - downloads a single trailer per movie in Apple's traler XML feed
# specify whether or not to get HD trailers - download priority is 1080p > 720p > 480p > standard 640 wide
GET1080p=0
GET720p=0
GET480p=1
GETPOSTER=1
FEEDS="http://www.apple.com/trailers/home/xml/current.xml"
#for local testing specify a file instead of hitting the net for the feed
#FEEDS="./Apple640Trailers.xml"
# define programs
XMLSTARLET='xml'
AWK='gawk'
# hard-coded file extension for saved videos
# ideally we'd preserve the extension of the original movie file and only add the ".trailer" before it
# (in case there's anything other than .mov)
BEXTENSION=".trailer.mov"
# save location for the individual trailer folders
SAVEPATH="v:/Movies/zzztrailertest/"
#save path for the tracker file below
DLDBPATH="./"
# text file to keep track of completed downloads to prevent getting the same trailer the next time script runs
tail -5000 $DLDBPATH.downloaded.db > $DLDBPATH.downloaded.db.tmp
mv $DLDBPATH.downloaded.db.tmp $DLDBPATH.downloaded.db
# this cleans passed content of characters that are invalid for Windows filenames and some which are valid but unwanted
FILECLEANER_AWK='
{
## some html escapes:
gsub(">",">")
gsub("<","<")
gsub(""","\"")
gsub("”","\"")
gsub("„","\"")
gsub("‘","\"")
gsub("’","\"")
gsub("‚",",")
gsub("&","\\&")
## replace fancy "smart" quotes with straight equivalents
gsub("’","'"'"'")
gsub("‘","'"'"'")
gsub("“","\"")
gsub("”","\"")
gsub("„","\"")
gsub("„","\"")
## backquote to apostrophe
gsub("`","'"'"'")
## double quote to apostrophe
gsub("\"","'"'"'")
## select illegal filename characaters replaced by alternates
gsub(">",")")
gsub("<","(")
gsub("[:]"," - ")
gsub("[/]","-")
## backslash to dash
gsub("\\\\","-")
gsub("[?]","")
gsub("[|]","-")
gsub("*","+")
## double space to single space (we may have created a double space in a previous substitution)
gsub(" "," ")
## sanitize the rest:
## gsub("[^- '"'"'[:alnum:] _$+&={}\\[\\]()%@!;,.]*","")
gsub("^[[:blank:]]*", "")
gsub("[[:blank:]]*$", "")
## dump it to stdout
print
}
'
# main loop - passes once per feed specified above
for FEEDURL in $FEEDS; do
# set of partial movie metadata - only the fields we need for downloading, saving & tracking the video/image files.
IFS=$'\n' TRAILERS=(`$XMLSTARLET sel --net -E utf-8 -D -T -t -m "/records/movieinfo" \
-v "@id" -o '	' \
-v "info/title" -o '	' \
-v "info/postdate" -o '	' \
-v "preview/large" -o '	' \
-v "poster/xlarge" --nl \
$FEEDURL 2>/dev/null`)
# complete set of movie metadata to be saved out one file per video later - one record per line
IFS=$'\n' movieFields=(`$XMLSTARLET sel --net -E utf-8 -t -m "/records/movieinfo" \
-c "." \
--nl \
$FEEDURL 2>/dev/null`)
recordDATE=`$XMLSTARLET sel --net -D -T -t -m "/records" \
-v "@date" \
$FEEDURL 2>/dev/null`
# individual feed loop - passes once per movie in feed
count=-1
for MOVIE in "${TRAILERS[@]}"; do
# bash (and ksh and zsh) can do math this way
count=$(($count+1))
# notice I set the delimiter with an argument instead of in a BEGIN
MOVIEID=`echo $MOVIE | $AWK -F'\t' '{ print $1 }' 2>/dev/null`
MOVIETITLE=`echo $MOVIE | $AWK -F'\t' '{ print $2 }' 2>/dev/null`
# giving the script as an argument instead of a file containing the script
MOVIETITLEFILE=`echo "$MOVIETITLE" | $AWK "${FILECLEANER_AWK}"`
POSTDATE=`echo $MOVIE | $AWK -F'\t' '{ print $3 }' 2>/dev/null`
# web path to the video file referenced in the feed xml
PREVIEW=`echo $MOVIE | $AWK -F'\t' '{ print $4 }' 2>/dev/null`
# filename substitutions to allow getting HD versions of the referenced file
# HARD CODED - need logic if referenced names have extensions other than "h640w.mov"
PREVIEW1080p=${PREVIEW%%h640w.mov}h1080p.mov
PREVIEW720p=${PREVIEW%%h640w.mov}a720p.mov
PREVIEW480p=${PREVIEW%%h640w.mov}h480p.mov
# web path to the poster file
POSTER=`echo $MOVIE | $AWK -F'\t' '{ print $5 }' 2>/dev/null`
# new local filename to save poster file
NEWPOSTERNAME="folder.jpg"
# added braces around the variable names for clarity
MOVIESAVEPATH="${SAVEPATH}${MOVIETITLEFILE}"
# create a folder for the downloaded files (using the movie's cleaned name)
if ! grep -q "###$MOVIEID.PREVIEW" $DLDBPATH.downloaded.db; then
mkdir -p $MOVIESAVEPATH
fi
# save the trailer's XML data to its own file within the trailer's folder
echo -e "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<records date=\"$recordDATE\">${movieFields[$count]}</records>" >$MOVIESAVEPATH/temp.xml
# reformat the XML to make it human-readable
`$XMLSTARLET format $MOVIESAVEPATH/temp.xml >$MOVIESAVEPATH/description.xml`
`rm $MOVIESAVEPATH/temp.xml`
# get and save a 1080p (1920x...) resolution video file
if [ "$GET1080p" -eq "1" ]; then
if ! grep -q "###$MOVIEID.PREVIEW" $DLDBPATH.downloaded.db; then
# new local filename to save video file
NEWPREVIEWNAME="${MOVIETITLEFILE} [1080p]${BEXTENSION}"
wget -c -O "$MOVIESAVEPATH/$NEWPREVIEWNAME" $PREVIEW1080p; PREVIEWOUT1080p=$?
if [ $PREVIEWOUT1080p -eq 0 ]; then
echo "###$MOVIEID.PREVIEW $NEWPREVIEWNAME" >> $DLDBPATH.downloaded.db
else
echo "##### ID:$MOVIEID URL:$PREVIEW1080p FAILED -- TRYING NEXT LOWER SIZE"
fi
fi
fi
# or get and save a 720p (1280x...) resolution video file
if [ "$GET720p" -eq "1" ]; then
if ! grep -q "###$MOVIEID.PREVIEW" $DLDBPATH.downloaded.db; then
# new local filename to save video file
NEWPREVIEWNAME="${MOVIETITLEFILE} [720p]${BEXTENSION}"
wget -c -O "$MOVIESAVEPATH/$NEWPREVIEWNAME" $PREVIEW720p; PREVIEWOUT720p=$?
if [ $PREVIEWOUT720p -eq 0 ]; then
echo "###$MOVIEID.PREVIEW $NEWPREVIEWNAME" >> $DLDBPATH.downloaded.db
else
echo "##### ID:$MOVIEID URL:$PREVIEW720p FAILED -- TRYING NEXT LOWER SIZE"
fi
fi
fi
# or get and save a 480p (848x...) resolution video file
if [ "$GET480p" -eq "1" ]; then
if ! grep -q "###$MOVIEID.PREVIEW" $DLDBPATH.downloaded.db; then
# new local filename to save video file
NEWPREVIEWNAME="${MOVIETITLEFILE} [480p]${BEXTENSION}"
wget -c -O "$MOVIESAVEPATH/$NEWPREVIEWNAME" $PREVIEW480p; PREVIEWOUT480p=$?
if [ $PREVIEWOUT480p -eq 0 ]; then
echo "###$MOVIEID.PREVIEW $NEWPREVIEWNAME" >> $DLDBPATH.downloaded.db
else
echo "##### ID:$MOVIEID URL:$PREVIEW480p FAILED -- TRYING STANDARD SIZE"
fi
fi
fi
# or get and save the standard (640x...) resolution video file as referenced in the XML feed
if ! grep -q "###$MOVIEID.PREVIEW" $DLDBPATH.downloaded.db; then
# new local filename to save video file
NEWPREVIEWNAME="${MOVIETITLEFILE}${BEXTENSION}"
wget -c -O "$MOVIESAVEPATH/$NEWPREVIEWNAME" $PREVIEW; PREVIEWOUT=$?
if [ $PREVIEWOUT -eq 0 ]; then
echo "###$MOVIEID.PREVIEW $NEWPREVIEWNAME" >> $DLDBPATH.downloaded.db
else
echo "##### ID:$MOVIEID URL:$PREVIEW FAILED -- RETRY NEXT RUN"
fi
else
echo "##### Trailer ID:$MOVIEID NAME:$MOVIETITLE MARKED DONE -- SKIPPING"
fi
# get and save the movie poster image
if [ "$GETPOSTER" -eq "1" ]; then
if ! grep -q "###$MOVIEID.POSTER" $DLDBPATH.downloaded.db; then
wget -c -O "$MOVIESAVEPATH/$NEWPOSTERNAME" $POSTER; POSTEROUT=$?
if [ $POSTEROUT -eq 0 ]; then
echo "###$MOVIEID.POSTER $NEWPOSTERNAME" >> $DLDBPATH.downloaded.db
else
echo "##### $ID:$MOVIEID URL:$POSTER FAILED -- RETRY NEXT RUN"
fi
else
echo "##### Poster ID:$MOVIEID NAME:$MOVIETITLE MARKED DONE -- SKIPPING"
fi
fi
done
done
Edited by hybrid8 (16/01/2009 17:46) Edit Reason: Included Bitt's string-replace changes plus fixed missing title on output of SKIPPED messages
|
Top
|
|
|
|
#318230 - 16/01/2009 16:18
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
Change this: PREVIEW1080p=`echo $MOVIE | $AWK -F'\t' '{ print $4 }' |sed 's/h640w\.mov$/h1080p.mov/g'` to: PREVIEW1080p=${PREVIEW%%h640w.mov}h1080p.mov That saves about four new processes. I think you can figure out the syntax for the other two. Oh, and you don't need to save $? to an intermediate variable.
Edited by wfaulk (16/01/2009 16:20)
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318231 - 16/01/2009 17:26
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: wfaulk]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Hehe. That was a pretty awkward way to work the substitution, I'll admit. Even though I didn't know how to do it the way you just suggested, I should have used the existing PREVIEW variable instead of parsing through the array again. Serves me right for too much copy paste.
With regards to the $? are you talking about the result back from the wget? That part is unchanged from the original script I found. I'd prefer to put the wget within the IF itself, but again, bash is just completely unintuitive to me. It's absolutely nothing like C, Basic, Pascal or PHP - stuff I've used and learned over the years.
|
Top
|
|
|
|
#318232 - 16/01/2009 17:52
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
Yeah, this part: wget -c -O "$MOVIESAVEPATH/$NEWPREVIEWNAME" $PREVIEW1080p; PREVIEWOUT1080p=$?
if [ $PREVIEWOUT1080p -eq 0 ]; then Is more commonly written as just: wget -c -O "$MOVIESAVEPATH/$NEWPREVIEWNAME" $PREVIEW1080p
if [ $? -eq 0 ]; then There's not really anything wrong with the way you have it, but it seems ... wasteful. There's no way to embed the wget inside the if. Well, I suppose you could do this: if [ `wget ....; echo $?` -eq 0 ]; then but don't.
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318233 - 16/01/2009 18:14
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: wfaulk]
|
carpal tunnel
Registered: 24/12/2001
Posts: 5528
|
Implementing this in Perl would be cleaner IMO.
|
Top
|
|
|
|
#318234 - 16/01/2009 18:15
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: tman]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
Absolutely. But Bruno keeps saying that he wants to stick with bash, sed, and awk.
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318235 - 16/01/2009 18:15
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 24/12/2001
Posts: 5528
|
Or if it was something I needed to host remotely, I'd definitely do it in PHP. You can run PHP standalone and not as part of a webserver.
|
Top
|
|
|
|
#318236 - 16/01/2009 18:20
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: wfaulk]
|
carpal tunnel
Registered: 13/07/2000
Posts: 4181
Loc: Cambridge, England
|
There's no way to embed the wget inside the if. Well, I suppose you could do this: if [ `wget ....; echo $?` -eq 0 ]; then but don't. If you only need the return code in order to drive the "if", what's wrong with: ...? Peter
|
Top
|
|
|
|
#318237 - 16/01/2009 19:05
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: wfaulk]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Nonono... I absolutely loath bash. Hadn't I made that clear? What I said was that I had cygwin installed on my Windows machine and therefore bash was already there. And importantly that I had sourced this script which was already bash and wanted to avoid having to re-write the whole thing myself. Prior to you (Bitt) cleaning everything up, I was ready to install PHP so that I could rewrite the whole thing in PHP-CLI (still using XMLSTARLET though ) Trevor, I found out last night about installing PHP for use without a web server. If I had known that someone would have taken the time to pretty much re-write the whole thing for me, I would have said to feel free and do it in Perl or PHP. You guys are too nice however and also I wasn't trying to put anyone out. I thought maybe someone would just try it out to tell me if it was hanging for them or not and then I'd just resign myself to doing it all from (mostly) scratch. And Peter, with regards to the If wget... That's exactly how I'd do it in PHP, so that's what I was asking about the implementation in bash. Anyway, just another amazing example of the empegBBS circle of friends.
Edited by hybrid8 (16/01/2009 19:06)
|
Top
|
|
|
|
#318238 - 16/01/2009 19:30
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: peter]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
If you only need the return code in order to drive the "if", what's wrong with: ...? Well, you can only check for 0 vs. not-0 that way. It will work in this case, but there's no way to test for other exit codes.
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318241 - 16/01/2009 22:11
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
Mojo
Unregistered
|
Nonono... I absolutely loath bash. Hadn't I made that clear?
Me too. No offense to anyone who contributed to that script, but it's ugly and cryptic. I wanted a similar script after reading this thread, so I decided to write my own during my lunch break. Here it is. Why do I like mine better? Well it's a whole 27 lines shorter Also it doesn't have any external dependencies like xmlstarlet or awk or wget. And it's cross-platform; it'll run on Windows, Mac OS X, Linux, Solaris... I also find it to be much easier to read, which means much easier to edit and maintain. What do you need? You need a Tcl interpreter. If you're on Mac OS X or Linux then you're all set because Tcl should already be installed. If you're on Windows, install Tcl: http://www.activestate.com/activetcl/downloads/The destination folder for the trailers ($TargetDir) is set to the current working directory by default. So if you put the script in C:\whatever\trailers then you would do this: > cd C:\whatever\trailers > tclsh % source GetTrailers.tcl The files are organized by each movie title being the name of a folder which contains the trailer, large & extra large poster images, and a movieinfo.xml file that contains the relevant xml data pertaining to that movie (so that you have all the good info in there for some future use). #! /usr/bin/tclsh
# Location of the raw XML movie index.
set FeedsURL "http://www.apple.com/trailers/home/xml/current.xml"
# Download to the current directory.
set TargetDir [pwd]
# We'll use this global variable for the raw XML.
set FeedsXML ""
# And this will be for our organized movie data.
array set Movies [list]
# Load this standard Tcl package.
package require http
# Parses all relevant data for the next listed movie in the XML data, starting at the specified character index.
proc parseNextMovie {index} {
global FeedsXML Movies
set startIndex [string first {<movieinfo id="} $FeedsXML $index]
set endIndex [string first {</movieinfo>} $FeedsXML $index]
incr endIndex 11
set xml [string range $FeedsXML $startIndex $endIndex]
if { $startIndex == -1 } {
# There are no more movies to be parsed
return -1
}
# Parse the movie title.
set index [string first {<title>} $xml]
incr index 7
set end [string first {</title>} $xml $index]
incr end -1
set title [cleanTitle [string range $xml $index $end]]
# Parse the large movie poster URL.
set index [string first {<poster><location>} $xml]
incr index 18
set end [string first {</location>} $xml $index]
incr end -1
set posterLargeURL [string range $xml $index $end]
# Parse the extra large movie poster URL.
set index [string first {<xlarge>} $xml]
incr index 8
set end [string first {</xlarge>} $xml $index]
incr end -1
set posterXLargeURL [string range $xml $index $end]
# Parse the trailer URL.
set index [string first {<preview>} $xml]
set index [string first {">} $xml $index]
incr index 2
set end [string first {</} $xml $index]
incr end -1
set trailerURL [string range $xml $index $end]
# Save all this info in our Movies array.
set Movies($title) [list $xml $posterLargeURL $posterXLargeURL $trailerURL]
# Return the ending character index for this movie within $FeedsXML.
return $endIndex
}
# Downloads the specified movie trailer and posters.
proc downloadMovie {title} {
global Movies ProgressBar TargetDir
if { ![info exists Movies($title)] } {
return
}
set xml [lindex $Movies($title) 0]
set posterLargeURL [lindex $Movies($title) 1]
set posterXLargeURL [lindex $Movies($title) 2]
set trailerURL [lindex $Movies($title) 3]
# Download the posters.
# Use [catch] just in case the URLs are bad, which they would be if Apple
# didn't provide posters for a certain movie for some reason.
set fileToken [open temp_poster_l w]
fconfigure $fileToken -translation binary
catch {
set httpToken [http::geturl $posterLargeURL -channel $fileToken]
http::cleanup $token
}
close $fileToken
set fileToken [open temp_poster_xl w]
fconfigure $fileToken -translation binary
catch {
set httpToken [http::geturl $posterXLargeURL -channel $fileToken]
http::cleanup $token
}
close $fileToken
# Download the trailer.
set ProgressBar -1
set fileToken [open temp_trailer w]
fconfigure $fileToken -translation binary
catch {
set httpToken [http::geturl $trailerURL -channel $fileToken -progress downloadProgress]
http::cleanup $token
}
close $fileToken
# Create a new directory for our freshly downloaded movie.
set dir $TargetDir/$title
file mkdir $dir
# Move all of our movie files into this directory.
file rename temp_poster_l $dir/poster_l.[file extension $posterLargeURL]
file rename temp_poster_xl $dir/poster_xl.[file extension $posterXLargeURL]
file rename temp_trailer $dir/[file tail $trailerURL]
# Save the xml data pertaining to this movie as movieinfo.xml.
set token [open $dir/movieinfo.xml w]
puts $token $xml
close $token
return
}
# Callback procedure for downloads that keeps us informed of the download progress.
proc downloadProgress {token total current} {
global ProgressBar
# Initiate ProgressBar if necessary.
if { $ProgressBar < 0 } {
set ProgressBar 0
puts "<------------------>"
flush stdout
}
# Calculate the number of progress bars that should be displayed.
set bytesPerBar [expr { 1.0 * $total / 20 }]
set bars [expr { int($current / $bytesPerBar) }]
while { $ProgressBar < $bars } {
puts -nonewline "|"
flush stdout
incr ProgressBar
}
return
}
# Replaces undesirable or incompatible characters with friendlier ones.
proc cleanTitle {title} {
set title [string map {
> >
< <
" \"
” \"
„ \"
‘ \"
’ \"
‚ ,
& &
> )
< (
: -
/ -
\\ -
? ""
| -
* +
} $title]
return $title
}
# Download the movie index.
set token [http::geturl $FeedsURL]
set FeedsXML [encoding convertfrom utf-8 [http::data $token]]
http::cleanup $token
# Loop through the XML, parsing movie data until there are no more movies to parse.
set index 0
while { $index > -1 } {
set index [parseNextMovie $index]
}
# Let's see which movies we already have downloaded and remove them from our Movies array.
# We'll see what directories are in our $TargetDir, and assume each is the name of a movie.
foreach file [glob -directory $TargetDir -nocomplain -tails *] {
if { [file isdirectory $file] } {
if { [info exists Movies($file)] } {
unset Movies($file)
}
}
}
# Now our Movies array only contains movies which haven't been downloaded yet. Let's download them one by one.
set titles [lsort -dictionary -increasing [array names Movies]]
set count 0
foreach title $titles {
incr count
puts "\nDownloading $count/[llength $titles] \"$title\""
downloadMovie $title
}
Attachments
GetTrailers.tcl (205 downloads)
|
Top
|
|
|
|
#318242 - 17/01/2009 00:36
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: ]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Umm, let me amend what I said before... I loath bash, but tcl looks like a huge pain in the ass and makes literally no sense whatsoever to someone used to programming in a sensible programming language. I think I can easily shave 20-30 lines from the original bash script, especially by optimizing the conditionals (they're currently done in a really lame way), but that tcl version looks infinitely more complicated to maintain. No offense of course as I imagine it's just the nature of the syntax. Benefits of the bash/awk/xmlstarlet solution: -Final xml metadata save file is formatted cleanly for human readability as well as programatic processing -The script can handle getting HD versions of the trailers - I didn't see the ability to get anything but what was specified in the XML feed in the tcl version -you can delete trailers or trailer folders that have been downloaded and the script will not re-download them - that benefit is realized by storing a list of downloaded trailers in a data file. This, IMO, is a very important necessity unless you want to keep a copy of absolutely everything going forward. -Not in the version I pasted, but easy to modify, all visual output can be hidden making it even better suited for scheduling automated launching. I don't intend to run the script manually except while testing it initially. I had a binary program for Windows that did a similar task but it didn't keep track of what had already been downloaded using an external file, so it suffered the same problem as the tcl above, plus I didn't like the way it names the movies. It also didn't allow downloading any resolution like the bash solution does, nor did it save the XML data for the movies (though a version was released as a plugin for Media Portal which I think may have done that specific to that host app). All things considered this is a really trivial problem to solve and my only problem was using the tools for the already existing script (bash, xmlstarlet, awk and sed) which I have only ever touched so briefly in the past. I still think this implemented in PERL or PHP would be a lot easier to follow from a code perspective and could be done a lot cleaner. At this point however this is working fine so I'm not concerned with redoing it. It does require the UTF-8 patch for cygwin, but it's for my own personal use anyway. Other people can feel free to take it to whatever next level they want.
|
Top
|
|
|
|
#318248 - 17/01/2009 04:15
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
Mojo
Unregistered
|
Well suit yourself I obviously disagree that the bash script is more readable than Tcl. Also, if I used xmlstarlet and wget, I would shave about 130 lines off of that code. I could easily make it a quarter of the size of that bash script. Downloading the high-res trailers is a trivial modification. Also, I think it's a much cleaner solution to check the folder for existing movies than to maintain a flat file. I'm not looking to delete a movie to save 20 mb of diskspace, and the previews are rotated every so often anyways. If I was going to persist data, I would just go ahead and make a full-fledged sqlite database containing all of the movie info provided by apple. I suppose though that we are after two different solutions. You want to see new and upcoming movies on your TV. So I'm sure disk space is limited and you only want to see new stuff that you're interested in anyways and don't want to re-download trailers you've deleted because they don't interest you. I, on the other hand, would like to have the preview of every movie ever made if I could. I enjoy doing things like mirroring wikipedia. Still, I'd take Tcl over bash for a 200 line script any day. By all means though, use what works for you.
|
Top
|
|
|
|
#318252 - 17/01/2009 12:10
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: ]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
The HD trailers are anywhere from 20-200MB each, but my disk space isn't that limited as I'v got about 3TB on my media server right now. 1.5TB of that is mirrored raid.
I will keep SOME trailers persistently, but I won't keep ALL trailers persistently. Basically this is how I'll manage the whole affair...
I will be downloading everything that gets listed in the Apple feed. This allows me to see what's coming up. I will only keep trailers for movies I have or that I would like to watch, which means some will be deleted. This allows me to review trailers for stuff that's already out as well.
As I purchase full length movies, I store them on my media server. I'm totally getting away from physical media. I did it for music years ago and now it's time to cut the cord for video.
Now, along with the full length feature, I have a good chance of also having the trailer for that feature saved along with it. This allows me (or other members of the house and friends) to take a quick peek at the trailer and read a synopsis to see if that's the movie they want to fire up and watch.
|
Top
|
|
|
|
#318518 - 27/01/2009 21:59
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Ok, hopefully someone is still feeling generous enough to help with a few new syntax questions... I'm modifying the script to check for trailer that have been updated (trailer 2 etc..) and I have the checks for that working fine using posting dates. Now I need to provide an OR condition when it comes time to save the downloads. The existing IF is this:
if ! grep -q "###$MOVIEID.PREVIEW" $DLDBPATH.downloaded.db; then I would like to include an OR for variable $getUpdate equal to 1. I've tried a couple of things that didn't work, such as this: if ( ! grep -q "###$MOVIEID.PREVIEW" $DLDBPATH.downloaded.db ) || [ "$getUpdate" -eq "1" ]; then and this if (( ! grep -q "###$MOVIEID.PREVIEW" $DLDBPATH.downloaded.db ) || [ "$getUpdate" -eq "1" ]); then The error produced is: "integer expression expected" - which is unexpected to me since an integer is what I thought I was expressing. In an earlier part of the script I'd like to check if xmlstarlet fails to open an input file.
oldPostDate=(`$XMLSTARLET sel -E utf-8 -D -T -t -m "/records/movieinfo" \
-v "info/postdate" \
"$MOVIESAVEPATH/description.xml" `)
Even when this produces an error the script works, but I'd like to skip performing a few steps if this isn't able to pull the data I'm expecting. Lastly, with regards to the same call just posted above, does anyone know any way that I can pass in the variable $MOVIESAVEPATH if it contains an apostrophe (single quote)? Will I have to massage it first to escape out that character before using it with xmlstarlet? It's a valid filename character but if one exists, it causes xmlstarlet to interpret it as a quote which then causes the syntax and passed params to break.
|
Top
|
|
|
|
#318519 - 27/01/2009 22:08
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
This is one of the reasons you shouldn't (IMO) get in the habit of using programs as arguments to if. grep -q "###$MOVIEID.PREVIEW" $DLDBPATH.downloaded.db
if [ ! "$?" -o "$getUpdate" -eq "1" ]; then Actually, "[" is a program. It's the same as "test". If you are obsessed with having the program inside the if, this is the correct syntax: if ! ( grep -q "###$MOVIEID.PREVIEW" $DLDBPATH.downloaded.db && [ "$getUpdate" -ne "1" ] ); then Notice that I had to invert some boolean operators to get it to work.
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318520 - 27/01/2009 22:14
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: wfaulk]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
I was editing/adding when you posted the reply, thanks Bitt.
With regards to the part I just added about checking to see if xmlstarlet fails to open its input file, I tried using the "$?" check as used elsewhere but this didn't work. It always seems to equal 0 when xmlstarlet fails to find the input file.
|
Top
|
|
|
|
#318521 - 27/01/2009 22:18
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
$? relies on the program it's checking on to be well behaved and return a useful return value. If it always just exits with the return code of 0 (which is the default and signifies success), even if there is an error, then there's not much you can do. Other than checking the file yourself manually to begin with.
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318522 - 27/01/2009 22:24
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
does anyone know any way that I can pass in the variable $MOVIESAVEPATH if it contains an apostrophe (single quote)? Will I have to massage it first to escape out that character before using it with xmlstarlet? Unless there's some oddness in xmlstarlet, you should be good the way you are. Generally speaking, commands don't get interpolated twice. You're actually in a situation where they might because you're using backticks, but I'm pretty sure you're okay in this instance: % var="this's an apostrophe"
% echo `echo "$var"`
this's an apostrophe
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318524 - 27/01/2009 22:38
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: wfaulk]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
I'll go with the grep outside the IF. I didn't write that conditional and wouldn't have done it that way if I had. Especially since that grep command is issued numerous times throughout the script. I'd rather do it once and then store the result in a variable.
|
Top
|
|
|
|
#318525 - 27/01/2009 22:48
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
I know xmlstarlet fails when I pass that var containing an apostrophe.
It produces an XPATH error: Invalid Expression
The output shows an arrow pointing to the first character after the apostrophe. In this case the variable contained "He's Just Not That Into You" and the arrow (a carat) was below the first "s"
The whole error output echoed back the variable name in single quotes, so perhaps when xmlstarlet sees its passed parameters from bash it's seeing them in single quotes? This would then obviously cause the first open quote to be closed by the apostrophe.
|
Top
|
|
|
|
#318528 - 28/01/2009 00:20
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
No, it's telling you that an apostrophe is invalid input.
_________________________
Bitt Faulk
|
Top
|
|
|
|
#319785 - 25/02/2009 18:05
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: wfaulk]
|
new poster
Registered: 25/02/2009
Posts: 2
|
That's actually a script I threw together for my media server and happened to stumble upon this thread doing a search for updated Apple XML feeds. Apple has broken things several times now which required some work-arounds over time. The 720p feed appears to recently become stale which I'll be post an updated version shortly to deal with this. Here is where I post updated versions of the script: http://majjix.com/luke/blog/081013/automatically-downloading-quicktime-trailers-and-posters
Edited by lstepnio (25/02/2009 18:37)
|
Top
|
|
|
|
#319790 - 26/02/2009 00:04
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: lstepnio]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
I stopped using the 720 feed if you look closely at the last copies I posted. Your script didn't deal with different filename standards nor file extensions that were present in Apple's feed, so instead I look only at the base feed and instead create the correct filenames for the different sizes as required. The filename formats I've included in my changes are the ones necessary to get the correct/full downloads of the movie files. The filename formats included in the initial script you wrote wouldn't work for all the posted videos. I've made a number of of other changes that aren't included in the copy I last put up here and it's been running flawlessly since that time. The only drawback stems from the simple fact that Apple simply doesn't post all its videos to the XML feed. The most robust solution would instead use their RSS feed, but it's a lot more complicated to parse out as it also receives content such film excerpts. A very robust (I only did a quick look) script exists for MythTV that uses the RSS feed. Since I had what I needed working I didn't look into porting it though. Someday when I have some time I'll re-write this in PHP, since bash scripting is just about the most useless and ugly pile of crap I've ever had the displeasure of working with. No offense to anyone who happens to like bash scripting of course. If I have the chance in the next few days I'll send up my current version and you can take a look at the changes I've made.
|
Top
|
|
|
|
#319794 - 26/02/2009 03:33
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
new poster
Registered: 25/02/2009
Posts: 2
|
The script from the MP forums was the initial script which didn't deal with the issues that came up over time which you have described. The issues you describe have been addressed as they appeared in the feeds. You'll find over time that Apple seems to make what appear to be totally random and inconsistent changes. The latest issue was that the 720p feed appears to be stale for the past week or so. I've worked around this recent by attempting to guess the HD file names based on the SD feed which is working on all but one item in the current feeds. Please post your source as it would be nice to incorporate any improvements into my usage. Here's the source for the current version: http://majjix.com/code/090219/appletrailersI was recently pointed to this website which appears to be promising for a feed source: http://www.hd-trailers.net/The feed they have available seems a bit chaotic and will be more trouble to parse. http://www.hd-trailers.net/blog/feed/If the feed was a bit better this would a better source as they list Apple and Yahoo trailers. :cheers:
|
Top
|
|
|
|
|
|