Ranter
Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Comments
-
Is this the parsing a language through regex? I read something on SO where someone wanted to parse XML with regex. Someone chimed in on how regex was not able to do that for some technical reasons. Something about the type of expressions regex can and cannot parse. I didn't really underestand the reasoning.
-
@Ranchu I'm a sick, sick person, clearly. I keep saying I'll go back to being a normal person when I have a job again, but I'm honestly not sure if that's true.
-
@Demolishun Okay, so the idea is a user passes in a regular expression, then I convert that regular expression to a string, parse it with regex to categorize the different parts of it and add them to an object with the index of each match as the key on the object. After building the object, I'd then go through the object's values in order of key to generate a string describing what the regular expression is matching.
.... I really need to find a better way to explain it. :D -
@Demolishun it's fairly straightforward - regular expressions are "linear", they have no "memory". They cannot remember past information and act on it.
XML has recursive structures, if you consider <tag> bleh </tag>, by the time you get to the closing tag, you have to "remember" that you opened a tag and match that as now closed. You can parse the above with a regex, but if you have <tag> <tag> bleh </tag> </tag>, the parser now has to "remember" that you opened two tags and have to close two tags for a perfect match. Regexes can't do this. You can hardcode it to work with 2 tags, but then it wouldn't work for 1 or 3 or anything else that's not 2.
Even worse, take <tag> <innertag> bleh </innertag> </tag>. The parser now has to "remember" that it opened a tag first, then a innertag, so it should close a innertag first then a tag for a perfect match. Regexes just can't remember this much. They're "stupid" that way.
You can think of a regex matcher as a machine with a single cell as the memory (the "present"). The technical term for this is linear automaton. Something that matches recursive structures can be modelled by a similar machine but with a stack ("the past, but in order") for the memory which can grow as needed. The technical term for that is a pushdown automaton. On a massively simplified level, anything that is recursive needs a matcher with more "powerful" memory that can remember more extensive patterns (for this case a stack was enough, but if you want to generalize that, if you give the machine an infinitely large RAM stick ("the past") you essentially get the next level up, a Turing machine, which can match even more complicated languages). -
@RememberMe I should point out that I didn't explicitly show recursion in the examples. Consider what would happen if the "bleh" inside the tags was another set of XML tags. Even worse for the regex since it has to remember more now. Regexes can't match corresponding open-close, let alone ones done recursively. (Two tags is an example of recursion too, since it's an XML structure inside an XML structure).
-
@Ranchu I wasn't born with an instinctual knowledge of regular expressions, so yes. :P
-
i can only recommend retina (https://github.com/m-ender/retina) for larger regex based projects. Has some amazing features and deals with some of the problems described by @rememberme
Related Rants
Restarting regular expression parser from scratch has been good. I am somehow both much farther to completion and farther away from completion than I was in the earlier implementation.
Further in the sense that this implementation is going to be way more flexible to changes in the language
Farther in that I haven’t even got all of the regex parts added to the first stage yet.
But I’m feeing good about it.
Even if I did refactor it so my constants are in all caps and now feel like my core is yelling at me.
rant
regex
naming conventions are weird
hopeful