Java Regular Expressions 181
Simon P. Chappell writes "Regular expressions (regex to their friends) are an incredibly powerful addition to most programmer's personal toolkit of techniques. Programming using a language that doesn't support them can be frustrating if you need to do any amount of non-trivial string handling. Java was just such a language until the release of the 1.4.x series. Sure, there were libraries like ORO that would provide regex support, but it wasn't built in and not many companies allow the use of 3rd party libraries. With version 1.4.x, the corporate Java developer in the trench, received the power of regular expression pattern matching." Read the rest of Simon's review.
Java Regular Expressions | |
author | Mehran Habibi |
pages | 255 (7 page index) |
publisher | Apress |
rating | 8/10 |
reviewer | Simon P. Chappell |
ISBN | 1590591070 |
summary | A great starter for using regular expressions in Java |
The book seems targeted towards those who have a solid level of Java programming skills, but who have not yet used the java.util.regex package. I see two types of Java programmers who might not have used the regex package, those who do not know about regular expressions and those who know them, but have not yet used them within Java. This book should satisfy both sets of users. The first group will be benefited by the general introduction to regular expressions and the gentle introduction to using them within Java. The later group will benefit from the more advanced material in the book.
The book is nicely structured and progresses easily through its subject matter. The first chapter is an introduction to regular expressions. While this is most obviously for the readers new to the subject, it will be useful for those more experienced, because not all regex engines are created equal and this chapter lays out the particular dialect of regular expressions used by the Java 1.4.x regex engine. The second chapter introduces the object model used by java.util.regex. This gives detailed explanations of the Pattern and Matcher objects as well as the new regular expression methods added to the standard String class.
The third chapter takes the reader into advanced Regular expressions. While there is much that can be done using just the Pattern and Matcher objects, the path to the full power of regex travels through an understanding of groups (and subgroups) and qualifiers. Regex groups are hard to explain until you've seen them in action, whereupon you may find yourself wondering how you'd ever managed without them before. Mr. Habibi does an excellent job, both explaining them and introducing us to the unusual noncapturing subgroups. (I'd never heard of these before.) Qualifiers are the other side of the same coin with groups. While it's one thing to define a group and whether it's expected and to be captured, it's equally important to be able to describe the expected occurrence of those groups using qualifiers.
Chapter four tackles the interesting challenges of using regex in an object-oriented language. Mr. Habibi describes the general principles of use of regex as similar to those used with SQL through the JDBC interface. These principles are the optimisimg of connections, batching reads and writes, storing patterns externally, Just In Time compilation of patterns and remembering that not every piece of String handling code needs to be written as a regex. All very useful advice.
Chapter five is the big examples chapter. All of the examples are intended to be practical; the kind of thing you might have to address at the day job. With examples covering Zip codes, telephone numbers, dates, searching text files and even validating an EDI document, he seems to have delivered on that assertion. There are further examples in Appendix C, if the afore-mentioned patterns aren't enough.
The writing and progression of material are good. The examples are very well thought out and explained. Many of the examples are built from first principles. Mr. Habibi seems to want to not only teach you how to use regular expressions, but also how to design them. He does this by working up from an understanding of the data until he has a working regex.
While it doesn't make any promises about being an encyclopedia of regex patterns, this book does contain enough of the normal business patterns to be a useful initial reference work, before turning to the Internet to search for patterns.
If you want an encyclopedic reference work on regex, then buy Jeffery Friedl's Mastering Regular Expressions which is published by O'Reilly. This is not that book, preferring to stick with the practical usage of regex.
This is a great starter book, for developers who are new to using regular expressions in Java."
You can purchase Java Regular Expressions from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.
When speed matters (Score:4, Informative)
However, like many things in computer science, speed gains come at a price. In this case, the regular expression language supported is not quite as rich as the JDK implementation.
Re:When speed matters (Score:4, Insightful)
(I know, let the flames commence!
Comment removed (Score:5, Informative)
Re:When speed matters (Score:2)
I've got a dual Athlon MP2000+, and Azureus still is horribly slow compared to everything else I run on it.
Re:When speed matters (Score:2)
Re:When speed matters (Score:2)
Thanks for your insight.
Re:When speed matters (Score:2)
Re:When speed matters (Score:1, Informative)
And, to promote object-oriented programming, the printf functionality is all located in a final class, so you can't inherit the printf functionality in any other class! Instead you have to wrap another object! Yay object oriented design!
One of my favorite features coming up in Java 6 is the support for sc
Re:When speed matters (Score:5, Informative)
Here is the class I assume the parent is referencing: Formatter class [sun.com].
Essentially what happens is you don't have C-style varargs, the JRE silently creates an array for you when you pass the arguments. This doesn't waste "gobs of heap space" like the parent says, it uses the same amount as it would using the stack. Remember, these are objects, and Java never passes objects by value -- always by refence. So each argument wastes one machine word (usually 32 bits). Whoop de fucking doo. And, since it uses references, the only allocation/deallocation is the temporary array. And in 1.5 if not previous versions, this is very very fast. With a JIT compiler you'll hardly notice it. I do agree that the decision to make the class "final" is shitty, but honestly, I don't see how subclassing it would be a huge advantage. It would be like subclassing the java.lang.String class. Sure, you could add some nifty stuff, but it's not a big deal.
As a person who earns his living off of J2EE, I know its strengths and weaknesses. I am not a fanboy, however. I am more than willing to give Java hell when it deserves it. I think string handling in general is not as well-organized or easy to use as it could be, but it is certainly capable. I rarely use sprintf() style string formatting anyway, even in C++. I find it much easier to use iostreams, which are typesafe and almost as fast as sprintf(). In Java I just use string concatenation, and the formatting classes when I need it. It isn't perfect, but it works well enough and sure isn't slow.
Regular Expression? (Score:2, Funny)
Starbucks Employee: That'll be an hour's wages please.
Me: Thanks!
Thats when you get to see my java regular expression.
Generally it will be me wincing in pain because I just burned my tongue. Sometimes, if it's cooled enough, you'll hear a quiet "MmmMmmm" in the style of Family Guy's Herbert.
Re:Regular Expression? (Score:2)
Recursion? (Score:2)
Re:Recursion? (Score:5, Interesting)
When this is run on some text like the following happens: we see our opening paren, so all is well. Then we see some things which are not parens (lambda ) and all is still well. Now we see (, which definitely is a paren. Our first alternative fails, we try the second alternative. Now it's finally time to interpolate what's inside the double-secret operator, which just happens to be $paren. And what does $paren tell us to match? First, an open paren - ooh, we seem to have one of those handy. Then some things which are not parens, such as x, and then we can finish this part of the match by matching a close paren. This polishes off the sub-expression, so we can go back to looking for more things that aren't parens, and so on.
Re:Recursion? (Score:4, Informative)
Re:Recursion? (Score:2)
Re:Recursion? (Score:2, Funny)
KFG
Re:Recursion? (Score:2, Informative)
By definition, Regular Expressions are limited to regular languages [wikipedia.org], thus can be expressed by Finite Automata [wikipedia.org]. This prohibits them from supporting recursion, but generally makes them easy to optimize.
Re:Recursion? (Score:4, Informative)
Context free languages may have recursion. They use a state machine (pushdown automaton) and a stack to recognize thier languages.
http://en.wikipedia.org/wiki/Context-free_languag
This also contains links to other families of language and info on the automaton that can recognize them.
Welcome to Theory of Computing!
Re:Recursion? (Score:2)
Theory of Computing (Score:2)
Re:Recursion? (Score:2)
Perhaps you should have read that page more closely. Or maybe taken a class in theory of computation.
Regular grammars are *not* the same thing as regular *languages*, which are what is under discussion here.
First off, it is true that regular *grammars* can express context-free *languages*. Of course, this also means that they can express regular la
Re:Recursion? (Score:2)
Wrong way round (Score:2, Interesting)
Er, no. It is only for trivial string handling that the regex approach is useful.
For non-trivial string handling (particularly if you feel like giving the authors of erroneous strings helpful error messages!!) I'll write a proper lexical analys
Re:Wrong way round (Score:2)
You can outfit a regexp functor with error message handling, or exceptions, and if your project is embedded (certainly not trivial) or performance-dependent, I'm not sure that I'd write a lex/parser "every time". I guess it boils down to this: "trivial string handling" is semantic nonsense.
Re:Wrong way round (Score:3, Informative)
Re:Wrong way round (Score:4, Insightful)
Re:Wrong way round (Score:2)
However, if you're just doing vanilla text parsing with data that's not overly complex, regexs are an absolute godsend, and are far easier to use than a full lexer/parser package.
Re:Wrong way round (Score:2)
Re:Wrong way round (Score:2)
Not many companies allow 3rd party libraries? (Score:3, Funny)
Re:Not many companies allow 3rd party libraries? (Score:3, Informative)
Re:Not many companies allow 3rd party libraries? (Score:2)
My main complaint (Score:5, Informative)
My main complaint about java regexps is that all the backslashes have to be quoted with a backslash, making them completely unreadable compared to a language that supports regular expressions natively, like perl (no, a standard library is not technically native support). "\d" becomes "\\d" and so forth. Does anyone know a simple way around this? We just started using java regexp's at work, so the extra backslashes don't bother most people, but they are extremely annoying to those of us with a lot of perl experience.
P.S. How many slashdotters thought they'd be rolling in their graves by the time they heard an example of where perl is more readable than java?
Re:My main complaint (Score:5, Funny)
I'm still amazed to find 'readable' and 'regular expressions' in the same context.
Re:My main complaint (Score:4, Interesting)
In general, C#'s regular expression package is very nice, except for the whole "groups" and "captures" thing.
Re:My main complaint (Score:4, Informative)
Actually, Python's literal strings are NOT """
.""" is for multiline strings (' and " only accept one-line strings or antislash linebreak escapers), literal python strings are rawstrings and created by prefixing any string (be it ', " or """) by the "r" character (as in r"this is a raw strings" "but this is not).
Re:My main complaint (Score:2, Insightful)
effing java.
Re:My main complaint (Score:1)
Re:My main complaint (Score:5, Informative)
Backslashes in a .properties file have to be escaped with (guess what?) a backslash.
So it, unfortunately, solves nothing.
If you don't mind XML, you can use the XML properties format, but you're still adding a lot of extra code just so you don't have to deal with escape characters. There's, unfortunately, no good solution in Java. (There are no raw strings in Java.)
Re:My main complaint (Score:3, Informative)
For example, the regex defined here: becomes simply "foo.bar.regex", just like a standard properties file.
Re:My main complaint (Score:2)
1. Beating something that's already dead.
2. Using an Apache-licensed software package and creating an external file dependency to solve the fact that your language doesn't support raw strings.
Re:My main complaint (Score:2)
Re:My main complaint (Score:2)
You're asking about Java regexps, but similar problems extend to other languages where the the syntax, features and usage are different enough so that anyone with a basis in Perl is similarly annoyed, if not dumbfounded by the awkwardness and limitations. Any systems administrator will tell you
Re:My main complaint (Score:2)
It may be helpful, I haven't tried it. Would be particularly interesting to see if it'll correctly convert, say "\t" into "\\t" instead of a TAB. If it does, then you could use it to wrap the strings for the regexp pattern methods.
Re:My main complaint (Score:2)
Re:My main complaint (Score:2, Interesting)
Pattern foo = Pattern.compile("c:/foo/bar".replace('/','\\'));
or just put the above in a library method that does it automatically:
Pattern foo = PatternUtils.compile("c:/foo/bar");
which is handy if other replacements are made by that library method also:
Pattern foo = PatternUtils.compile("({number}):{number}:({ident
Re:My main complaint (Score:2)
http://eclipse-plugins.2y.net/eclipse/rating_detai ls_plugin.jsp?plugin_id=964 [2y.net]
A good idea is to include the regular expressions in a comment as well. Most of the time creating and testing a regular expression takes most of the time anyway. If you really hate the escaped regular expressions, just put them in a resource (e.g.
Re:My main complaint (Score:2)
I've done this with C on Windows when I had one library that borked whenever you tried to use / in pathnames.
Pick unicode characters for your special strings, e.g. . Next, map some handy keystroke to that in your editor. Then write a script to replace that with a standard Java string. Since it's not standard java, give it a special extension and add the script and extension to your makefile or ant or whatever you use.
Perl (Score:2)
Having said that I really don't see why you have to devote a complete book on regex. A small tutorial does just fine.
Re:Perl (Score:2)
> have to devote a complete book on regex.
> A small tutorial does just fine
I think it depends on how deep you want to go into regular expressions. Mastering Regular Expressions [oreilly.com] by Jeffrey Friedl is almost 500 pages but is an excellent treatment of the subject - by the time you're done reading it you'll feel comfy even with such madness as negative lookbehind.
Microsoft and regex (Score:4, Interesting)
Back when my only experience was development on Windows I was very frustrated with the lack of good string handling in Microsoft languages (VB, T-SQL). If you didn't find a third-party library you had to write a lot of expensive code to do fancy string searches. Try writing recursion in VB6 without bringing your computer to a screeching halt.
Then when I switched to linux and open source I was shocked to learn that something as useful as regex had already been around for many years. Most of the Windows developers I knew never even heard of it. It was tricky to learn but has paid off many times over in utility.
Every developer is better of for knowing it. Even if they never use regex the thought process in understanding it is quite interesting and educational.
Re: (Score:2)
What? (Score:3, Interesting)
Who's boneheaded enough to do this? I want to know so I can avoid buying anything from them, because their products are going to be overpriced by at least 50% due to the wasted effort.
I can understand restricting third-party libraries to those of a certain license, like BSD or LGPL, but a blanket ban without any exceptions for something as essential as regular expressions? That's just stupid.
One of the biggest advantages of Java is the enormous number of high-quality third-party libraries available.
Is this just something the submitter dreamed up to fill space, or do companies actually do this?
Re:What? (Score:2)
... that make up for the lack of high-quality useful first-party packages.
Re:What? (Score:2)
Re:What? (Score:3, Funny)
Re:What? (Score:2)
It's DLL Hell [wikipedia.org] all over again. Every time you use a third-party library, the user has to make sure it's installed. And in the classpath, unless they installed it as roo
Re:What? (Score:2)
Guess what? Any ClassLoader is required to query its parent before it attempts to resolve a class. And no, you can't get around it (for security reasons which become obvious if you think about it).
If a particular version is already in the class path at JVM startup, you can't override it.
Re:What? (Score:2)
Re:What? (Score:2)
But the government systems to which I referred are not J2EE environments (yet).
I also agree that J2EE and manifests are not difficult to use, but that doesn't seem to be the prevailing opinion among most of the other developers I meet.
Wha-wha-what? (Score:2)
Who are these companies and what can possibly be their justification for such a blanket policy. I can understand for some ultra-high security/uptime systems with incredibly strict standards and processes who would need to put third party code through an extensive and expensive audit. But for the rest of us? No jUnit? log4j? Is Boost allowed? Good lord, I can't imagine programming in such a world.
I hope I never work for one of these firms.
Re:Wha-wha-what? (Score:2)
Re:Wha-wha-what? (Score:2)
Taft
Fear! (Score:2)
Re:Fear! (Score:2)
Re:Wha-wha-what? (Score:2, Insightful)
Who are these companies and what can possibly be their justification for such a blanket policy.
Actually there are a number of firms that contain multitudes of red tape that disable their employees from getting anything done without the barest of tools. I have witnessed major separations of "church and state" with these larger companies. This includes the company that did not allow the developers access to the servers, resulting
"not many companies allow ... 3rd party libraries" (Score:2)
regex coach (Score:4, Informative)
Re:regex coach (Score:2, Informative)
CL-PPCRE [weitz.de] - portable Perl-compatible regular expressions for Common Lisp
A library which typically outperforms Perl's own regex engine.
RegEx not so maintainable... (Score:3, Interesting)
As I get older, my code has gotten more and more straightforward, cause I consider to maintainance cycle of code to be more than 95% of the puzzle. And these days, I have more than one security analyst who is not a senior software engineer poking around me code.
RegEx's are not-so-readable and not-very-maintainable programming abstracts that should be avoided whenever possible. I prefer using string manipulation abstraction classes (such as my own version of StringTokenizer). They are not as fast and furious as other methods like lexical analysis, and the code is more bloated, but the code is Straight Forward And Easy To Read. There is a power is code of this nature, and my clients have thanked me more than once to not focusing on writing "cool code" but for writing "clean and simple" code. I just tried to paste in a few ugly regex samples, but slashdot blocked me calling them "junk characters" I agree!
For example, take XPATH, this is a clean and simple way to address XML objects. Sure, there is an additional level of abstraction, but you can look at an XPATH query, even from a layman's point of view, and have a clear understanding as to what it is doing.
Re:RegEx not so maintainable... (Score:2)
If a regex isn't quickly comprehensible to you, either a) the regex is badly written, or b) you need more practice with regex's.
Seriously, it's very rare for me to come across a regex I'm unable to comprehend. And for more complex ones, Perl certainly allows you to intersperse the regex with comments (I don't recall if Java allows this, though it does support a significant subset of Perl re
Re:RegEx not so maintainable... (Score:2, Insightful)
java.util.regex speed sucks (Score:2)
Re:java.util.regex speed sucks (Score:2)
Re:java.util.regex speed sucks (Score:2)
3rd party libraries (Score:2)
Why regular expressions... (Score:2)
Re:Why regular expressions... (Score:2)
It's not difficult. It's impossible. Perhaps you should start off by using the right tool for the right job.
So what you want... (Score:2)
You could always build your own regular expression compiler. It's not unheard of. But I submit that the "language" is small enough that it's not worth it.
Re:So what you want... (Score:2)
Re:Why regular expressions... (Score:2)
It's in fact impossible in true regular expressions since it requires you to maintain a stack.
> Yet there have been languages that have advanced string matching capabilities around since the 60's (start looking at Snobol -- which is still alive -- and some of it's descendants).
Advanced matching is coming in Perl6 (which is runnable right now, http://www.pugscode.org./ [www.pugscode.org] Along
Re:Why regular expressions... (Score:2)
Yes -- the point was that a regular expression doesn't handle such things as a searching for balanced parentheses. However even old Snobol had the facility for dealing with balanced parentheses without getting into full grammars and parsers
Rapid Java Regex Prototyping (Score:2)
Re:Rapid Java Regex Prototyping (Score:3, Insightful)
Re:Rapid Java Regex Prototyping (Score:2)
Nope. But I develop spam filter rules all the live long day. These sometimes demand 10 or so very hairy regexes (zero-width assertions and all) all fire in conjunction, then they have to be tweaked slightly to work whenever the spam mutates slightly. You have no idea how convenient it is to have a tool like Pattern Sandbox that will light up the matches when you incrementally tweak a rule expression so y
Topical plug: Regex Powertoy (Score:2)
Great things about the Java 1.4+ regex support, from my perspective, include that (1) it's nearly as full-featured as Perl's regexes (and thus far better than Javascript's); and (2) it's usable in web browsers and via embedded applets.
Those were both key to helping me create Regex Powertoy [powertoy.org], a interactive visual regex tester, much like others mentioned in this discussion -- but fully implemented in a browser. It's in JavaScript and DHTML, with a Java applet for the full-featured and step-controlled regex m
255 pages about Java regex? (Score:2)
and now, to celebrate... (Score:2)
And now to celebrate this new-found ability to manipulate strings easily:
s/trench,/trench/;
Ah, I knew that would make me feel better.
Re:and now, to celebrate... (Score:2)
http://java.sun.com/j2se/1.5.0/docs/api/java/util
Re:and now, to celebrate... (Score:2)
Actually I meant like correcting the punctuation error with the newfound power of regular expressions. The fact that I used a sed-like (or perl-like) expression was just incidental and was only because that's the syntax I knew off the top of my head.
Third party packages (Score:2)
not many companies allow the use of 3rd party libraries
I assume the review author hasn't worked for many companies then. I have yet to find any company the doesn't use third party packages. Logging, XML parsing and unit testing are just the first three things that spring to mind when I consider what might require a third party package. As for the "DLL hell" that someone alleges in a post to this thread, it's virtually non-existant. You ship the third party packages with your application (as a single JAR
Re:Java sucks (Score:1, Flamebait)
Re:Java sucks (Score:2, Informative)
"Some String".replaceAll("Java", "Bloated piece of shit")
And FYI PatternSyntaxException is a runtime exception so no need to catch it and rethrow as a RuntimeException.
so to write it your way:
String theTruth(String s){
return Pattern.compile("Java").matcher().replaceAll(s);
}
Re:Java sucks (Score:4, Funny)
Oh, I think you're hardly being fair to Java - your example was artificially bloated. I can easily do this in one line in Java:
Runtime.getRuntime( ).exec( "perl -e 'sub theTruth($) { shift; $_ =~ s/Java/Not so bad now/; return $_; }" );
I think you owe Java an apology.
Communicating to an external process... (Score:2)
You fork, then dup2 the child's STDIN to the "far end" of the former pipe,
then you dup2 the child's STDOUT onto the "far end" of the latter pipe.
Finally, you exec() in your child.
You hold onto the two near ends and use them as seperate Input/Output streams for control.
You're going to need to:
1) Catch SIGPIPE for when the spawned process closes it's reading end of the pipe.
2) Catch SIGCHLD so you know when the proces
Re:Java sucks (Score:5, Informative)
Try: Still not as compact but at least there aren't any tildes in there. I wonder if there would be a more compact way to do it. This seems terribly heavy weight for such a simple example. Oh, wait! There is! So now we compare: To: So the Java code ends up being a handful of characters longer and much easier to read. I'm not saying that Java is the ideal Regex language, but your example sucked.
Re:Java sucks (Score:2)
First, take out that ($) prototype. Perl doesn't use them that way. In Perl, a prototype is not for the same purpose as they are in other languages. They're for type coercion between scalar and array contexts; in this case you're saying "if they give me an array like a stupid git, please coerce it into a scalar context for me, thanks." If they pass a 26-element array, coercing it to a scalar context ends up giving you a numeri
Re:Java sucks (Score:2, Insightful)
You could have said also that the Fire Department sucks because they are not good at catching burglars, or that the Police Department is full of losers b
Boost? No thanks (Score:2)
Re:Boost? No thanks (Score:2)
developed on Unix. If other OSes don't support various portions of it then
thats a failing on their part, but on OSes that do support it theres no reason
to use Boost unless you really like obfuscated code.