Using a regexp to tokenize?

Posted by: Dylan

Using a regexp to tokenize? - 16/04/2004 10:15

Is there a way to write a regular expression such that it will produce a match for each of a series of repeating tokens.

I'm trying to parse a query string into a set of name,value matches. These are my test input cases:

a = b
&a=b&
a=b& c=d
a =b&c=d&
a=b&c= d&e=f

I'd like to produce matches of (a,b) or (a,b,c,d) or (a,b,c,d,e,f). It's obvious for a human to see what the desired matches are.

The following regexp will match against the first token of each example and give the correct matching substrings.

[& ]*([^= ]*)[ ]*=[ ]*([^&= ]*)[& ]*

But is there a way to make the pattern iterate over the entire input and return multiple matches. I could write controlling logic to walk through the input string but it would be easier if I could make the regexp engine do it.

Thanks.
Posted by: siberia37

Re: Using a regexp to tokenize? - 16/04/2004 10:36

I don't thnk the regexp can do all the work in the regard your thinking- but why not use a replacing regular expression to replace all the matches with commas "," and then enclose it the resulting string in paranthesis.
Posted by: canuckInOR

Re: Using a regexp to tokenize? - 17/04/2004 02:16

What language are you using? This is fairly trivial using perl:
        perl -e'while(<>){@m=/(\w+)\s*=\s*(\w+)/g;print "@m\n"}'
It gets a bit trickier (but not a whole lot) if you want to allow more than word characters (alphanumerics and underscore) as your a and b:
        @m=/\s*([^&=]+?)\s*=\s*([^&]+)/g
And, if you're sure you'll always get pairs, then you can assign that to a hash, as well, so you automatically have key => value pairs.

edit: Note... the while() is just looping over stdin, so you can input your test cases, not looping your regex over the input. That part is taken care of by the /g modifier.
Posted by: Dylan

Re: Using a regexp to tokenize? - 17/04/2004 10:02

My God. Is that a Martian dialect?

I'm writing this in C using libpcre as the regexp engine. In isolation, it wouldn't be difficult to write the surrounding logic to walk through the string, evaluating the regexp one token at a time. But for various long and boring reasons it would fit into our app better if it could all be controlled by run time regexp configuration.

Thanks for the reply, though. I really should become facile with Perl. There are so many times I find myself needing a quick scripting solution but I end up using C because it's what I comfortable with.
Posted by: andy

Re: Using a regexp to tokenize? - 18/04/2004 03:13

My God. Is that a Martian dialect?

You should see what people can do with Perl when they are trying to write obtuse code:

@P=split//,".URRUU\c8R";@d=split//,"\nrekcah xinU / lreP rehtona tsuJ";sub p{
@p{"r$p","u$p"}=(P,P);pipe"r$p","u$p";++$p;($q*=2)+=$f=!fork;map{$P=$P[$f^ord
($p{$_})&6];$p{$_}=/ ^$P/ix?$P:close$_}keys%p}p;p;p;p;p;map{$p{$_}=~/^[P.]/&&
close$_}%p;wait until$?;map{/^r/&&<$_>}%p;$_=$d[$q];sleep rand(2)if/\S/;print

http://perl.plover.com/obfuscated/
Posted by: wfaulk

Re: Using a regexp to tokenize? - 18/04/2004 09:49

That one's way too obvious. It's obviously got source material in there, just backwards.
Posted by: andy

Re: Using a regexp to tokenize? - 18/04/2004 09:56

It might be obvious what it does, but it is far from obvious how it does it.

It spawns separate processes to print each single character of the message, to syncronise the processes it opens pipes between them and tracks the state of each process.

Something like that anyway.
Posted by: wfaulk

Re: Using a regexp to tokenize? - 18/04/2004 09:59

I didn't even begin to try to figure that out. I just prefer the ones where there's no obvious source material at all, so that it seems to generate output from nowhere.