#318111 - 13/01/2009 23:13
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: mlord]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
gsub("[^- 'a-zA-Z0-9 _$+&={}\\[\\]()%@!;,.]*","") Not sure about that one -- putting square brackets inside square brackets is ambiguous. Ok, that's fair. So how do I allow them otherwise?
|
Top
|
|
|
|
#318112 - 13/01/2009 23:15
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
Bitt, is your example supposed to say "blank" ? Yes. And if you're going to deal with Unicode, then you probably want to use more of those character classes. Assuming it will deal with Unicode, you can't assume that "[a-z]" includes all lowercase characters. What about "ö"?
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318113 - 13/01/2009 23:16
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14494
Loc: Canada
|
gsub("[^- 'a-zA-Z0-9 _$+&={}\\[\\]()%@!;,.]*","") Not sure about that one -- putting square brackets inside square brackets is ambiguous. Ok, that's fair. So how do I allow them otherwise? Oh, my apologies.. you already have them properly backslashed. Note that, inside a [] construct, you can simply use [ instead of \\[, but still need to do \\]
Edited by mlord (13/01/2009 23:18)
|
Top
|
|
|
|
#318114 - 13/01/2009 23:17
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
To include one of the characters ‘\’, ‘]’, ‘-’, or ‘^’ in a character list, put a ‘\’ in front of it. Those are the only oddball characters in a character class. Note that you can use '[' unescaped.
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318115 - 13/01/2009 23:38
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: wfaulk]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Assuming it will deal with Unicode, you can't assume that "[a-z]" includes all lowercase characters. What about "ö"? I did mention that on a post I edited on the first page of the thread. That I needed to support accented character variants. The manual page you linked isn't specific about whether [:alnum:] includes é, å, ñ, etc.. It does mention that you can use an equivalence class for accented characters, but then also says that the regexp matching in awk doesn't support equivalence classes. Then there are other characters that are part of foreign alphabets that are valid within filenames which can conceivably be used in the movie names listed on Apple's site. Such as ß, œ and others. In the future I'd like to break out extended information and full text naming into a metadata file which will be used by the application (SageTV in my case) and can have the filenames completely void of all these special cases. I'm not at that stage of integration yet and will still need to install some mods on my PVR to make use of any metadata files I create.
|
Top
|
|
|
|
#318116 - 13/01/2009 23:50
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
If you need to deal in-depth with Unicode, ditch awk and get something that is designed to handle Unicode. Seriously. awk barely handles 8-bit characters, much less variable length ones.
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318117 - 14/01/2009 00:09
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
I've tried gsub("\u2019","'") for the curved single quote (number is the unicode hex value) but no success. I read about that on some page but my version of awk reports it's treated as a normal u, and not an escape for a unicode hex string.
I have yet to try \x and specify individual bytes.
I've verified that [:alnum:] is working for the characters that were caught by a-zA-Z0-9. But it's not working for accented characters.
I don't know if it's because the text is butchered before getting to awk or what. But awk is likely not seeing the unicode portions as multiple proper ascii characters, otherwise I'd have extra characters passing through instead of the accented ones simply being dropped. Arrgh.
|
Top
|
|
|
|
#318118 - 14/01/2009 00:34
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Ok, I now have cygwin supporting UTF-8 to some extent using the following patch: http://www.okisoft.co.jp/esc/utf8-cygwin/So I can at least see the UTF-8 characters properly in the console now. It says it will properly use UTF-8 for file IO as well. But the awk class I mentioned above is clearly dropping accented characters. After saving the awk file itself as UTF-8 (NO BOM! which is important) I can now properly match on UTF-8 characters typed into the file, such as ’ Only the accented characters left to figure out...
Edited by hybrid8 (14/01/2009 00:40)
|
Top
|
|
|
|
#318119 - 14/01/2009 01:14
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: wfaulk]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
As I mentioned, I'd love to ditch this whole thing and just write something in straight-up ansi-C which would likely be best since it wouldn't rely on external programs or interpreters like some other alternatives (though I'd need to link in some libraries). I could also go with PHP but that means installing a lot of stuff on the machine I don't otherwise have a need for. But PHP does have a lot of built-in functions to make the unicode and html entities more of a breeze.
It all also means redoing this whole thing, including the xml parsing bits which thus far I've just stolen from the existing script.
I think my best bet, since the input is relatively trust-worthy, is to just filter out the characters that are not allowed and just let everything else go. This is the opposite of what we've been discussing. Seems easy enough because there are only 9 characters not allowed in a Windows file name. And most of those were already included in the awk sample I posted.
I wanted to give one last go at setting awk up for an alternate language, but no matter how much searching I've done I can't find out how to set the environment variables. Perhaps that information is so basic that no one talks about it. Lots of discussion about the variables, but I have no idea where to put them. Doesn't seem to work just stuffing them into the script.
|
Top
|
|
|
|
#318128 - 14/01/2009 02:14
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
Most likely the version of awk that you have simply doesn't have support for Unicode at all.
Virtually every interpreted language you can think of using (Perl, PHP, Python) is going to have an XML parser. Given, that xmlstarlet program you're using is pretty handy.
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318136 - 14/01/2009 02:57
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: wfaulk]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Well, it's all working as well as it needs to be without changing the environment variables.
Patching cygwin for UTF-8 and saving the scripts as UTF-8 NO BOM allowed me to use UTF-8 characters. The xmlstarlet seems to pick up certain HTML entities and converts them automatically too. & already comes back as &.
By using gawk to replace characters, including those that are invalid in filenames, I'm left with a string that will work for what I need (file names, folder names and maintaining full legibility).
If this were something I was going to distribute then I'd go about it differently. Or if it was something I needed to host remotely, I'd definitely do it in PHP. I did find a nice PHP function that parses an XML document into a nice (and potentially large) array.
Thanks a lot for your help guys! I don't think I would have been able to get through all this without it.
|
Top
|
|
|
|
#318148 - 14/01/2009 10:12
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 18/01/2000
Posts: 5683
Loc: London, UK
|
I'm running the script from cygwin on a windows box You're on Windows? You're parsing XML? Use PowerShell. It supports XPath and XSLT (because it's .NET).
_________________________
-- roger
|
Top
|
|
|
|
#318154 - 14/01/2009 13:07
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: Roger]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
I think since it's all working, I'll stick with BASH and avoid having to learn a completely different scripting platform. I also don't know jack about XPath and XSLT. Which brings me to one additional problem. xmlstarlet is definitely handy, but I can't seem to figure out how to easily extract particular pieces of information. Here's a sample of the originating XML (somewhat formatted): <?xml version="1.0" encoding="utf-8"?> <records date="Tue, 13 Jan 2009 00:51:04 -0800"> <movieinfo id="2898"> <info> <title>12</title> <runtime>2:00</runtime> <rating>PG-13</rating> <studio>Sony Pictures Classics</studio> <postdate>2008-11-04</postdate> <releasedate>2009-03-04</releasedate> <copyright>© Copyright 2009 Sony Pictures Classics</copyright> <director>Nikita Mikhalkov</director> <description>12 characters. 12 truths. The story of 12 jurors discussing a verdict to pass on an 18 year old Chechen boy whether he is guilty of 1st degree murder of his step-father — an officer of the Russian army. The film thinks aloud about today;s life, about the need to hear the next of kin and help that person before its too late. The action of the picture unveils in one room — a gym adjusted for jury deliberations.</description> </info> <cast> <name>Sergei Makovetsky</name> <name>Nikita Mikhalkov</name> </cast> <genre> <name>Drama</name> <name>Foreign</name> </genre> <poster> <location>http://images.apple.com/moviesxml/s/sony/posters/12_l200811041428.jpg</location> <xlarge>http://images.apple.com/moviesxml/s/sony/posters/12_xl200811041428.jpg</xlarge> </poster> <preview> <large filesize="55098535">http://movies.apple.com/movies/sony/12/12_a720p.mov</large> </preview> </movieinfo> <movieinfo id="2904"> <info> <title>Angels & Demons</title> <runtime>1:10</runtime> <rating>Not yet rated</rating> <studio>Sony Pictures</studio> <postdate>2008-11-06</postdate> <releasedate>2009-05-15</releasedate> <copyright>© Copyright 2009 Sony Pictures</copyright> <director>Ron Howard</director> <description>The team behind the global phenomenon The Da Vinci Code returns for the highly anticipated Angels & Demons, based upon the bestselling novel by Dan Brown. Tom Hanks reprises his role as Harvard religious expert Robert Langdon, who once again finds that forces with ancient roots are willing to stop at nothing, even murder, to advance their goals. Ron Howard again directs the film, which is produced by Brian Grazer, Ron Howard, and John Calley. The screenplay is by Akiva Goldsman and David Koepp. When Langdon discovers evidence of the resurgence of an ancient secret brotherhood known as the Illuminati - the most powerful underground organization in history - he also faces a deadly threat to the existence of the secret organization’s most despised enemy: the Catholic Church. When Langdon learns that the clock is ticking on an unstoppable Illuminati time bomb, he jets to Rome, where he joins forces with Vittoria Vetra, a beautiful and enigmatic Italian scientist. Embarking on a nonstop, action-packed hunt through sealed crypts, dangerous catacombs, deserted cathedrals, and even to the heart of the most secretive vault on earth, Langdon and Vetra will follow a 400-year-old trail of ancient symbols that mark the Vatican’s only hope for survival. </description> </info> <cast> <name>Tom Hanks</name> <name>Ewan McGregor</name> <name>Ayelet Zurer</name> <name>Stellan Skarsgård</name> <name>Pierfrancesco Favino</name> </cast> <genre> <name>Drama</name> <name>Thriller</name> </genre> <poster> <location>http://images.apple.com/moviesxml/s/sony_pictures/posters/angelsdemons_l200811061144.jpg</location> <xlarge>http://images.apple.com/moviesxml/s/sony_pictures/posters/angelsdemons_xl200811061144.jpg</xlarge> </poster> <preview> <large filesize="32237184">http://movies.apple.com/movies/sony_pictures/angelsanddemons/angelsanddemons-tlr1_a720p.mov</large> </preview> </movieinfo> </records>
As you can see I've shown the sample using only two movies to keep it short. What I'd like to do is extract a single movie and throw it into its own XML file. The output would look pretty much as above except containing only one specific movie. However, I also want to preserve the "records" every time I do this. I've already played around with some extraction and can handily extract all movies or specific elements/attributes of all movies, but I need to target in on one at a time either by matching on their ID or by specifying an integer corresponding to the nth movie (I can figure out how to get a total count as well as keep a running total in the script to reference the "current" movie)
|
Top
|
|
|
|
#318168 - 14/01/2009 19:28
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Ok, I've mostly solved this myself with a bunch of overhead and definitely not very much finesse. But so far I'm getting the datapoints I need and in a format I can use to extract the final bit I want. I had to continue to use other facilities from the bash script, including awk and some hard-coded echos since I was unable to figure out how to use only xmlstarlet. Only one issue left...
SELECTMOVIE=`echo $TRAILERS | awk 'BEGIN { FS = "--DIVIDER--" } ; { print $7 }'`
How do I substitute that "$7" in the print statement with a variable? EDIT: Solved.
movieNum=7
SELECTMOVIE=`echo $TRAILERS | awk -v record=$movieNum 'BEGIN { FS = "--DIVIDER--" } ; { print $record }'`
The -v argument allows you to pick up external variables and assign them to variables within awk. I suppose since I'm here I might also pose another question which could be useful to me. xmlstarlet has a formatting option which takes in a file, cleans it up and spits it out to stdout like so: You can obviously redirect that to a file if you'd like. I'm looking for a way to supply it with XML from a variable in bash without having to first save out the contents of the var to a file. What I need is to have a single file in the end that has been properly formatted. I suppose I can save out a temporary file, format it out to the final file and then delete the temporary one. I was just hoping there was some way I could save that step or at least something less manual.
Edited by hybrid8 (14/01/2009 19:58)
|
Top
|
|
|
|
#318178 - 14/01/2009 22:03
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
Unix CLI tradition is to accept input from stdin if a filename argument is given as "-". So, I don't know that this works, but it's worth trying "echo $XMLDATA | xml format -". You might also try it with no filename argument at all: "echo $XMLDATA | xml format".
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318185 - 14/01/2009 22:25
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: wfaulk]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Well, it turns out this isn't working once I put it all into the existing loop. It keeps hanging and I thought it was the IFS... So removed it and replaced the for with a more traditional for using a counter. This seems to let it loop longer. I'm testing without the wgets to the server and just pulling the XML data. I can loop a full count of 86 items without an issue if I'm just tossing the data into a variable as I pasted above. This was causing a hang before with the other loop. But now if I put back the code to echo the contents of that variable to files (a different file for every pass of the loop) it hangs again well before completing. It seems to hang at a random point - a different pass each time I run it. Am I looping too fast and trying to create too many files too quickly? I just tried with a "sleep 1" after the file output and the first time it got through 13 passes of the loop. The second time 27 passes. This is getting really annoying.
|
Top
|
|
|
|
#318187 - 14/01/2009 22:51
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Arrgh, now I'm getting really pissed at this... Ok, I thought I had just seen something interesting, but it seems just like more randomness. It will hang even without the redirect to a file. I managed to get all passes done in the loop while outputting to the console (plain echo as shown above) but on future runs it hangs at random points.
|
Top
|
|
|
|
#318202 - 15/01/2009 16:11
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31597
Loc: Seattle, WA
|
it hangs at random points. Is it pulling the data live from the web each time through the loop? Maybe just TCP timeouts.
|
Top
|
|
|
|
#318203 - 15/01/2009 16:13
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: tfabris]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Not TCP timeouts because I'm currently testing by pulling the data from a local file. And the data is only pulled at the beginning of the script to create variables that hold it for later use.
This is super frustrating...
|
Top
|
|
|
|
#318204 - 15/01/2009 16:31
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31597
Loc: Seattle, WA
|
Maybe something is wrong with the particular bash interpreter version installed on that machine? Something about your script exercising the interpreter's memory manager in an odd way.
|
Top
|
|
|
|
#318205 - 15/01/2009 17:02
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: tfabris]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
I'm going to post the script in a little while, maybe some people will be kind enough to give it a shot to see if they also experience the same issue. I'll include the local file as well so it doesn't have the hit the network.
|
Top
|
|
|
|
#318206 - 15/01/2009 17:47
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Here's a cut-down version of the script which omits the parts that actually download the trailers but includes the parts that are causing the issue, the capturing of XML data on a per-movie basis. You'll have to define the paths at the top of the file appropriate for your system. SAVEPATH is where movie folders are created and movie data is saved if those lines are uncommented. SCRIPTPATH is the location of this script and the awk cleaner script (included in the attached zip file along with another copy of this main script and the XML data file). There are some lines commented out right now. The way it is now it will grab the metadata from the XML file into three variables. The TRAILERS variable is composed of select values from the XML file separated by ";field" and then each movie is separated by ";x545hwx1" - this is done so it's easy to pull values for individual movies when looping. If you refer to the original script you can see that the last separator used to be a plain newline, but for testing purposes I've gone to something more specific. The next variable, recordDATE contains only a single string pulled from the very top of the XML file containing the date the XML file was created. Then we have the movieFields variable which contains ALL fields for every movie. Individual fields are not separated by special markers, but each movie is separated by "--DIVIDER--" - this is the variable I use to pull the full metadata for each movie which I'd like to save out to a proper XML file, also for each movie (the line of code to do this is below and commented out). The movie data just mentioned is currently output to the console. If you comment out the echo line and instead enable the one above it, it will instead put the data into files, one per movie folder. You'll also have to enable the line that creates the movie name folders. This thing will hang for me at random times running this way or running with file creation. When i first started testing today it would complete the whole thing without a problem. I must have done it like 10 times in a row to the console as well as a few times outputting to files. Then it started hanging again at random points. No code was changed during these tests.
#!/bin/bash
movieRow=0
BEXTENSION=".trailer.mov"
GET1080p=0
GETPOSTER=1
SAVEPATH="v:/Movies/zzztrailertest/"
SCRIPTPATH="d:/AppleTrailers/"
FEEDURL="d:/AppleTrailers/current_720p.xml"
TRAILERS=`xml sel --net -D -T -t -m "/records/movieinfo"\
-v "@id" -o ";field"\
-v "info/title" -o ";field"\
-v "info/postdate" -o ";field"\
-v "preview/large" -o ";field"\
-v "poster/xlarge"\
-o ";x545hwx1"\
$FEEDURL`
recordDATE=`xml sel --net -D -T -t -m "/records"\
-v "@date"\
$FEEDURL`
movieFields=`xml sel --net -I -E utf-8 -t -m "/records/movieinfo"\
-c "."\
-o "--DIVIDER--"\
$FEEDURL`
for movieCOUNTER in `seq 1 86`; do
#sleep 1
MOVIE=`echo $TRAILERS | awk -v line=$movieCOUNTER 'BEGIN { FS = ";x545hwx1" } ; { print $line }'`
MOVIEID=`echo $MOVIE | awk 'BEGIN { FS = ";field" } ; { print $1 }'`
MOVIETITLE=`echo $MOVIE | awk 'BEGIN { FS = ";field" } ; { print $2 }'`
MOVIETITLEFILE=`echo "$MOVIETITLE" | $SCRIPTPATH/filecleaner.awk`
POSTDATE=`echo $MOVIE | awk 'BEGIN { FS = ";field" } ; { print $3 }'`
NEWPREVIEWNAME="$MOVIETITLEFILE$BEXTENSION"
POSTER=`echo $MOVIE | awk 'BEGIN { FS = ";field" } ; { print $5 }'`
NEWPOSTERNAME="folder.jpg"
MOVIESAVEPATH="$SAVEPATH$MOVIETITLEFILE"
#mkdir "$MOVIESAVEPATH/"
selectedMovie=`echo $movieFields | awk -v movieRecord=$movieCOUNTER 'BEGIN { FS = "--DIVIDER--" } ; { print $movieRecord }'`
#echo -e "<?xml version=\"1.0\" encoding=\"utf-8\"?>\r<records date=\"$recordDATE\">$selectedMovie</records>" >"$MOVIESAVEPATH/temp.xml"
echo -e "<?xml version=\"1.0\" encoding=\"utf-8\"?>\r<records date=\"$recordDATE\">$selectedMovie</records>"
done
Attachments
AppleTrailers.zip (151 downloads)
Edited by hybrid8 (15/01/2009 17:50) Edit Reason: adding attachment
|
Top
|
|
|
|
#318216 - 16/01/2009 02:45
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
Okay, I've been working on this for a while, and I've tidied it up, but I'm still not clear on what it is you're trying to do.
That said, it never seemed to fail for me, but here it is with the awk script embedded, using tabs and newlines as delimiters, and generally cleaned up. Added some comments so you can understand what's going on.
Attachments
AppleTrailerTEST.bash (190 downloads)
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318217 - 16/01/2009 03:05
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: wfaulk]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Thanks Bitt, I'll take a look at it in the morning.
The whole thing will provide me with up-to-date trailers, poster art and information for my PVR.
The script segment I posted is just grabbing the particulars of each movie from Apple's feed. The rest of the script (which was working fine) downloads the trailers. The whole thing together when run on a schedule will download any new trailer posted to Apple's trailer site, along with the poster for that movie and also its related metadata. These three things will be saved in a folder with the movie's name. The metadata will at some point be parsed by another tool which creates usable information for my PVR.
The script was just found, minus the new stuff that I was adding for the xml metadata saving. I used the oddball field separators because I wasn't sure whether the XML feed had any newlines in it and also because at some point I had some troubles escaping newlines, but I can't remember what problem I had anymore.
|
Top
|
|
|
|
#318221 - 16/01/2009 12:08
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
So far so good. I took a quick look at the script and then ran it to see if it was working. Needed only to change the xml program var at the top (since I have xmlstarlet in my PATH already).
Then to clean it up I had to change the character encoding for all the data extraction to UTF-8. ASCII uses only the lower 127 characters and won't support any accents nor the "smart" single and double quotes which may be present in the source data.
The loop structure that was in my sample was temporary only and in the full source I also had one which would only loop as many times as there were entries in the source XML.
Do you have any idea why the other version would hang? Could it be because of many (repeated) external calls to the awk file I had created?
Again, thanks for the brilliant help. You did an amazing job with the cleanup of the script. I could just understand the basics of what was going on in the original script, but I have no experience with bash syntax, so this is hugely appreciated.
Now I'm going to start adding back in the original trailer and poster downloading code and at the same time try to apply the same type of cleanup to those parts so they match what you've done.
I'll send an update when it's done and after I verify everything is working correctly I'll post it all again for review and of course for anyone that wants to use it.
|
Top
|
|
|
|
#318224 - 16/01/2009 14:29
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
Then to clean it up I had to change the character encoding for all the data extraction to UTF-8. ASCII uses only the lower 127 characters and won't support any accents nor the "smart" single and double quotes which may be present in the source data. That's for output. If you switch it back to ascii, you'll see that it understands the UTF8 input and prints &#nnnn; instead of the raw character, which is probably closer to where you need to be. I could be wrong, though.
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318228 - 16/01/2009 15:35
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: wfaulk]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
If I didn't take in utf-8 on the input (parse) then I'd have to introduce new code to utf-8-encode the data on output. Without an encode pass on output obviously we're left with the escaped multi-byte data as plain-text. So as I was adding back the rest of the content I discovered a few important things about the feeds and some deficiencies in the original script (again, I didn't write the original ) The original script was parsing through two feeds. A 720p feed and a feed for normal size videos (usually up to 640 wide). Each feed contains a PREVIEW section which contains keys to represent different sizes for the movie files. Both feeds use only the "LARGE" key however. It would have been nice if Apple had only a single feed and then used the size key to simply specify all the different file sizes available. Anyway, the script used the 720 feed to hard-code a filename substitution to try and guess a possible 1080 file. However the substitution was searching for "a720p.mov" and not all the included movies had that final extension. Some were specified as .m4v But there are no 1080 alternates with an m4v extension on the server.... I thought about doing some better file extension handling but then after some poking around on Apple's site I discovered that all files included in the feed were ALSO available with the a720.mov extensions. At this point it started to look like I could ignore the extension in the feed. The standard feed seems to be a superset of the 720 and as of today, contained a couple of movies not included on the 720. All movies in the standard feed end in h640w.mov I've changed the script to make it only look at the standard feed and from that to make substitutions for the correct extensions to obtain the HD files. I've duplicated the original 1080 conditional two times so the script can handle every different size uniquely, falling back from the highest to the lowest (vars allow enabling/disabling specific HD sizes): 1080p > 720p > 480p > standard 640w I can improve the logic but my biggest concern was to make sure the file capturing was working properly. From a functional point of view I'll have to keep watch to find out if files are ever introduced into the regular feed with a different extension other than h640w.mov - if so I'll have to include some logic instead of simple substitution to create the other filenames (otherwise it will never try to grab an HD version and will instead grab the standard version and simply name the output file with the HD filename). Next post will contain the script.
|
Top
|
|
|
|
#318229 - 16/01/2009 15:58
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
NEWER Apple Downloader Script.
#!/bin/bash
# Download Movie trailers from Apple - downloads a single trailer per movie in Apple's traler XML feed
# specify whether or not to get HD trailers - download priority is 1080p > 720p > 480p > standard 640 wide
GET1080p=0
GET720p=0
GET480p=1
GETPOSTER=1
FEEDS="http://www.apple.com/trailers/home/xml/current.xml"
#for local testing specify a file instead of hitting the net for the feed
#FEEDS="./Apple640Trailers.xml"
# define programs
XMLSTARLET='xml'
AWK='gawk'
# hard-coded file extension for saved videos
# ideally we'd preserve the extension of the original movie file and only add the ".trailer" before it
# (in case there's anything other than .mov)
BEXTENSION=".trailer.mov"
# save location for the individual trailer folders
SAVEPATH="v:/Movies/zzztrailertest/"
#save path for the tracker file below
DLDBPATH="./"
# text file to keep track of completed downloads to prevent getting the same trailer the next time script runs
tail -5000 $DLDBPATH.downloaded.db > $DLDBPATH.downloaded.db.tmp
mv $DLDBPATH.downloaded.db.tmp $DLDBPATH.downloaded.db
# this cleans passed content of characters that are invalid for Windows filenames and some which are valid but unwanted
FILECLEANER_AWK='
{
## some html escapes:
gsub(">",">")
gsub("<","<")
gsub(""","\"")
gsub("”","\"")
gsub("„","\"")
gsub("‘","\"")
gsub("’","\"")
gsub("‚",",")
gsub("&","\\&")
## replace fancy "smart" quotes with straight equivalents
gsub("’","'"'"'")
gsub("‘","'"'"'")
gsub("“","\"")
gsub("”","\"")
gsub("„","\"")
gsub("„","\"")
## backquote to apostrophe
gsub("`","'"'"'")
## double quote to apostrophe
gsub("\"","'"'"'")
## select illegal filename characaters replaced by alternates
gsub(">",")")
gsub("<","(")
gsub("[:]"," - ")
gsub("[/]","-")
## backslash to dash
gsub("\\\\","-")
gsub("[?]","")
gsub("[|]","-")
gsub("*","+")
## double space to single space (we may have created a double space in a previous substitution)
gsub(" "," ")
## sanitize the rest:
## gsub("[^- '"'"'[:alnum:] _$+&={}\\[\\]()%@!;,.]*","")
gsub("^[[:blank:]]*", "")
gsub("[[:blank:]]*$", "")
## dump it to stdout
print
}
'
# main loop - passes once per feed specified above
for FEEDURL in $FEEDS; do
# set of partial movie metadata - only the fields we need for downloading, saving & tracking the video/image files.
IFS=$'\n' TRAILERS=(`$XMLSTARLET sel --net -E utf-8 -D -T -t -m "/records/movieinfo" \
-v "@id" -o '	' \
-v "info/title" -o '	' \
-v "info/postdate" -o '	' \
-v "preview/large" -o '	' \
-v "poster/xlarge" --nl \
$FEEDURL 2>/dev/null`)
# complete set of movie metadata to be saved out one file per video later - one record per line
IFS=$'\n' movieFields=(`$XMLSTARLET sel --net -E utf-8 -t -m "/records/movieinfo" \
-c "." \
--nl \
$FEEDURL 2>/dev/null`)
recordDATE=`$XMLSTARLET sel --net -D -T -t -m "/records" \
-v "@date" \
$FEEDURL 2>/dev/null`
# individual feed loop - passes once per movie in feed
count=-1
for MOVIE in "${TRAILERS[@]}"; do
# bash (and ksh and zsh) can do math this way
count=$(($count+1))
# notice I set the delimiter with an argument instead of in a BEGIN
MOVIEID=`echo $MOVIE | $AWK -F'\t' '{ print $1 }' 2>/dev/null`
MOVIETITLE=`echo $MOVIE | $AWK -F'\t' '{ print $2 }' 2>/dev/null`
# giving the script as an argument instead of a file containing the script
MOVIETITLEFILE=`echo "$MOVIETITLE" | $AWK "${FILECLEANER_AWK}"`
POSTDATE=`echo $MOVIE | $AWK -F'\t' '{ print $3 }' 2>/dev/null`
# web path to the video file referenced in the feed xml
PREVIEW=`echo $MOVIE | $AWK -F'\t' '{ print $4 }' 2>/dev/null`
# filename substitutions to allow getting HD versions of the referenced file
# HARD CODED - need logic if referenced names have extensions other than "h640w.mov"
PREVIEW1080p=${PREVIEW%%h640w.mov}h1080p.mov
PREVIEW720p=${PREVIEW%%h640w.mov}a720p.mov
PREVIEW480p=${PREVIEW%%h640w.mov}h480p.mov
# web path to the poster file
POSTER=`echo $MOVIE | $AWK -F'\t' '{ print $5 }' 2>/dev/null`
# new local filename to save poster file
NEWPOSTERNAME="folder.jpg"
# added braces around the variable names for clarity
MOVIESAVEPATH="${SAVEPATH}${MOVIETITLEFILE}"
# create a folder for the downloaded files (using the movie's cleaned name)
if ! grep -q "###$MOVIEID.PREVIEW" $DLDBPATH.downloaded.db; then
mkdir -p $MOVIESAVEPATH
fi
# save the trailer's XML data to its own file within the trailer's folder
echo -e "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<records date=\"$recordDATE\">${movieFields[$count]}</records>" >$MOVIESAVEPATH/temp.xml
# reformat the XML to make it human-readable
`$XMLSTARLET format $MOVIESAVEPATH/temp.xml >$MOVIESAVEPATH/description.xml`
`rm $MOVIESAVEPATH/temp.xml`
# get and save a 1080p (1920x...) resolution video file
if [ "$GET1080p" -eq "1" ]; then
if ! grep -q "###$MOVIEID.PREVIEW" $DLDBPATH.downloaded.db; then
# new local filename to save video file
NEWPREVIEWNAME="${MOVIETITLEFILE} [1080p]${BEXTENSION}"
wget -c -O "$MOVIESAVEPATH/$NEWPREVIEWNAME" $PREVIEW1080p; PREVIEWOUT1080p=$?
if [ $PREVIEWOUT1080p -eq 0 ]; then
echo "###$MOVIEID.PREVIEW $NEWPREVIEWNAME" >> $DLDBPATH.downloaded.db
else
echo "##### ID:$MOVIEID URL:$PREVIEW1080p FAILED -- TRYING NEXT LOWER SIZE"
fi
fi
fi
# or get and save a 720p (1280x...) resolution video file
if [ "$GET720p" -eq "1" ]; then
if ! grep -q "###$MOVIEID.PREVIEW" $DLDBPATH.downloaded.db; then
# new local filename to save video file
NEWPREVIEWNAME="${MOVIETITLEFILE} [720p]${BEXTENSION}"
wget -c -O "$MOVIESAVEPATH/$NEWPREVIEWNAME" $PREVIEW720p; PREVIEWOUT720p=$?
if [ $PREVIEWOUT720p -eq 0 ]; then
echo "###$MOVIEID.PREVIEW $NEWPREVIEWNAME" >> $DLDBPATH.downloaded.db
else
echo "##### ID:$MOVIEID URL:$PREVIEW720p FAILED -- TRYING NEXT LOWER SIZE"
fi
fi
fi
# or get and save a 480p (848x...) resolution video file
if [ "$GET480p" -eq "1" ]; then
if ! grep -q "###$MOVIEID.PREVIEW" $DLDBPATH.downloaded.db; then
# new local filename to save video file
NEWPREVIEWNAME="${MOVIETITLEFILE} [480p]${BEXTENSION}"
wget -c -O "$MOVIESAVEPATH/$NEWPREVIEWNAME" $PREVIEW480p; PREVIEWOUT480p=$?
if [ $PREVIEWOUT480p -eq 0 ]; then
echo "###$MOVIEID.PREVIEW $NEWPREVIEWNAME" >> $DLDBPATH.downloaded.db
else
echo "##### ID:$MOVIEID URL:$PREVIEW480p FAILED -- TRYING STANDARD SIZE"
fi
fi
fi
# or get and save the standard (640x...) resolution video file as referenced in the XML feed
if ! grep -q "###$MOVIEID.PREVIEW" $DLDBPATH.downloaded.db; then
# new local filename to save video file
NEWPREVIEWNAME="${MOVIETITLEFILE}${BEXTENSION}"
wget -c -O "$MOVIESAVEPATH/$NEWPREVIEWNAME" $PREVIEW; PREVIEWOUT=$?
if [ $PREVIEWOUT -eq 0 ]; then
echo "###$MOVIEID.PREVIEW $NEWPREVIEWNAME" >> $DLDBPATH.downloaded.db
else
echo "##### ID:$MOVIEID URL:$PREVIEW FAILED -- RETRY NEXT RUN"
fi
else
echo "##### Trailer ID:$MOVIEID NAME:$MOVIETITLE MARKED DONE -- SKIPPING"
fi
# get and save the movie poster image
if [ "$GETPOSTER" -eq "1" ]; then
if ! grep -q "###$MOVIEID.POSTER" $DLDBPATH.downloaded.db; then
wget -c -O "$MOVIESAVEPATH/$NEWPOSTERNAME" $POSTER; POSTEROUT=$?
if [ $POSTEROUT -eq 0 ]; then
echo "###$MOVIEID.POSTER $NEWPOSTERNAME" >> $DLDBPATH.downloaded.db
else
echo "##### $ID:$MOVIEID URL:$POSTER FAILED -- RETRY NEXT RUN"
fi
else
echo "##### Poster ID:$MOVIEID NAME:$MOVIETITLE MARKED DONE -- SKIPPING"
fi
fi
done
done
Edited by hybrid8 (16/01/2009 17:46) Edit Reason: Included Bitt's string-replace changes plus fixed missing title on output of SKIPPED messages
|
Top
|
|
|
|
#318230 - 16/01/2009 16:18
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
Change this: PREVIEW1080p=`echo $MOVIE | $AWK -F'\t' '{ print $4 }' |sed 's/h640w\.mov$/h1080p.mov/g'` to: PREVIEW1080p=${PREVIEW%%h640w.mov}h1080p.mov That saves about four new processes. I think you can figure out the syntax for the other two. Oh, and you don't need to save $? to an intermediate variable.
Edited by wfaulk (16/01/2009 16:20)
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318231 - 16/01/2009 17:26
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: wfaulk]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Hehe. That was a pretty awkward way to work the substitution, I'll admit. Even though I didn't know how to do it the way you just suggested, I should have used the existing PREVIEW variable instead of parsing through the array again. Serves me right for too much copy paste.
With regards to the $? are you talking about the result back from the wget? That part is unchanged from the original script I found. I'd prefer to put the wget within the IF itself, but again, bash is just completely unintuitive to me. It's absolutely nothing like C, Basic, Pascal or PHP - stuff I've used and learned over the years.
|
Top
|
|
|
|
|
|