Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×

Java Regular Expressions 181

Simon P. Chappell writes "Regular expressions (regex to their friends) are an incredibly powerful addition to most programmer's personal toolkit of techniques. Programming using a language that doesn't support them can be frustrating if you need to do any amount of non-trivial string handling. Java was just such a language until the release of the 1.4.x series. Sure, there were libraries like ORO that would provide regex support, but it wasn't built in and not many companies allow the use of 3rd party libraries. With version 1.4.x, the corporate Java developer in the trench, received the power of regular expression pattern matching." Read the rest of Simon's review.
Java Regular Expressions
author Mehran Habibi
pages 255 (7 page index)
publisher Apress
rating 8/10
reviewer Simon P. Chappell
ISBN 1590591070
summary A great starter for using regular expressions in Java


The book seems targeted towards those who have a solid level of Java programming skills, but who have not yet used the java.util.regex package. I see two types of Java programmers who might not have used the regex package, those who do not know about regular expressions and those who know them, but have not yet used them within Java. This book should satisfy both sets of users. The first group will be benefited by the general introduction to regular expressions and the gentle introduction to using them within Java. The later group will benefit from the more advanced material in the book.

The book is nicely structured and progresses easily through its subject matter. The first chapter is an introduction to regular expressions. While this is most obviously for the readers new to the subject, it will be useful for those more experienced, because not all regex engines are created equal and this chapter lays out the particular dialect of regular expressions used by the Java 1.4.x regex engine. The second chapter introduces the object model used by java.util.regex. This gives detailed explanations of the Pattern and Matcher objects as well as the new regular expression methods added to the standard String class.

The third chapter takes the reader into advanced Regular expressions. While there is much that can be done using just the Pattern and Matcher objects, the path to the full power of regex travels through an understanding of groups (and subgroups) and qualifiers. Regex groups are hard to explain until you've seen them in action, whereupon you may find yourself wondering how you'd ever managed without them before. Mr. Habibi does an excellent job, both explaining them and introducing us to the unusual noncapturing subgroups. (I'd never heard of these before.) Qualifiers are the other side of the same coin with groups. While it's one thing to define a group and whether it's expected and to be captured, it's equally important to be able to describe the expected occurrence of those groups using qualifiers.

Chapter four tackles the interesting challenges of using regex in an object-oriented language. Mr. Habibi describes the general principles of use of regex as similar to those used with SQL through the JDBC interface. These principles are the optimisimg of connections, batching reads and writes, storing patterns externally, Just In Time compilation of patterns and remembering that not every piece of String handling code needs to be written as a regex. All very useful advice.

Chapter five is the big examples chapter. All of the examples are intended to be practical; the kind of thing you might have to address at the day job. With examples covering Zip codes, telephone numbers, dates, searching text files and even validating an EDI document, he seems to have delivered on that assertion. There are further examples in Appendix C, if the afore-mentioned patterns aren't enough.

The writing and progression of material are good. The examples are very well thought out and explained. Many of the examples are built from first principles. Mr. Habibi seems to want to not only teach you how to use regular expressions, but also how to design them. He does this by working up from an understanding of the data until he has a working regex.

While it doesn't make any promises about being an encyclopedia of regex patterns, this book does contain enough of the normal business patterns to be a useful initial reference work, before turning to the Internet to search for patterns.

If you want an encyclopedic reference work on regex, then buy Jeffery Friedl's Mastering Regular Expressions which is published by O'Reilly. This is not that book, preferring to stick with the practical usage of regex.

This is a great starter book, for developers who are new to using regular expressions in Java."


You can purchase Java Regular Expressions from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.
This discussion has been archived. No new comments can be posted.

Java Regular Expressions

Comments Filter:
  • When speed matters (Score:4, Informative)

    by SIGALRM ( 784769 ) on Wednesday August 02, 2006 @04:22PM (#15834717) Journal
    there were libraries like ORO that would provide regex support, but it wasn't built in and not many companies allow the use of 3rd party libraries
    For those who can utilize third-party libs, consider evaluating this DFA/NFA automaton [brics.dk], a regexp package that is significantly faster than java.util.regex.

    However, like many things in computer science, speed gains come at a price. In this case, the regular expression language supported is not quite as rich as the JDK implementation.
  • My main complaint (Score:5, Informative)

    by kbielefe ( 606566 ) <karl.bielefeldt@ ... om minus painter> on Wednesday August 02, 2006 @04:39PM (#15834825)

    My main complaint about java regexps is that all the backslashes have to be quoted with a backslash, making them completely unreadable compared to a language that supports regular expressions natively, like perl (no, a standard library is not technically native support). "\d" becomes "\\d" and so forth. Does anyone know a simple way around this? We just started using java regexp's at work, so the extra backslashes don't bother most people, but they are extremely annoying to those of us with a lot of perl experience.

    P.S. How many slashdotters thought they'd be rolling in their graves by the time they heard an example of where perl is more readable than java?

  • Re:Recursion? (Score:4, Informative)

    by addaon ( 41825 ) <addaon+slashdot.gmail@com> on Wednesday August 02, 2006 @04:54PM (#15834929)
    Of course, things like those presented are not regular expressions, no matter how loose perl might be with the term.
  • Re:Recursion? (Score:2, Informative)

    by Reverend528 ( 585549 ) on Wednesday August 02, 2006 @04:55PM (#15834940) Homepage
    I tried to do a bit of recursion in regexes once, like ((\d+)\.)+, but that didn't work.

    By definition, Regular Expressions are limited to regular languages [wikipedia.org], thus can be expressed by Finite Automata [wikipedia.org]. This prohibits them from supporting recursion, but generally makes them easy to optimize.

  • Re:My main complaint (Score:4, Informative)

    by masklinn ( 823351 ) <slashdot.org@mCO ... t minus language> on Wednesday August 02, 2006 @04:59PM (#15834961)

    Actually, Python's literal strings are NOT """

    .

    """ is for multiline strings (' and " only accept one-line strings or antislash linebreak escapers), literal python strings are rawstrings and created by prefixing any string (be it ', " or """) by the "r" character (as in r"this is a raw strings" "but this is not).

  • by Anonymous Coward on Wednesday August 02, 2006 @05:00PM (#15834970)
    And var-args! But not in that sane way that just adds more data to the stack, instead it wastes gobs of heap space and requires allocation/deallocation because making real var-args might involve thought!

    And, to promote object-oriented programming, the printf functionality is all located in a final class, so you can't inherit the printf functionality in any other class! Instead you have to wrap another object! Yay object oriented design!

    One of my favorite features coming up in Java 6 is the support for scripting languages. It's getting added in exactly the same way regular expressions were: as an external library. Now, instead of having to waste 200MB on a JRE, you'll get to waste 300MB! Yay, Java!
  • Re:My main complaint (Score:5, Informative)

    by _xeno_ ( 155264 ) on Wednesday August 02, 2006 @05:02PM (#15834986) Homepage Journal

    Backslashes in a .properties file have to be escaped with (guess what?) a backslash.

    So it, unfortunately, solves nothing.

    If you don't mind XML, you can use the XML properties format, but you're still adding a lot of extra code just so you don't have to deal with escape characters. There's, unfortunately, no good solution in Java. (There are no raw strings in Java.)

  • by Canthros ( 5769 ) on Wednesday August 02, 2006 @05:03PM (#15834997)
    It does, however, simplify the legal mess involved.
  • Re:Recursion? (Score:4, Informative)

    by Anonymous Coward on Wednesday August 02, 2006 @05:08PM (#15835036)
    Regular expressions are only for regular languages. They are the simplest type of language and use a simple state machine (automaton) to do their language recognition.
    Context free languages may have recursion. They use a state machine (pushdown automaton) and a stack to recognize thier languages.
    http://en.wikipedia.org/wiki/Context-free_language [wikipedia.org]
    This also contains links to other families of language and info on the automaton that can recognize them.
    Welcome to Theory of Computing!
  • Re:Wrong way round (Score:3, Informative)

    by smallfries ( 601545 ) on Wednesday August 02, 2006 @05:13PM (#15835080) Homepage
    I'm not sure if you got the parents point (apologies if you did). By trivial string handling he's talking about recursive structures, and the erroneous strings he's mentioning are probably programs as input to a compiler. The 'non-trivial' strings are the class of strings that you would need a full grammar in order to parse, rather than a reg-exp. But yeah, not every time - horses for courses and all that.
  • regex coach (Score:4, Informative)

    by mgkimsal2 ( 200677 ) on Wednesday August 02, 2006 @05:14PM (#15835087) Homepage
    I spoke about the "regex coach" tool from http://weitz.de/regex-coach/ [weitz.de] on my podcast (shameless plug!) http://webdevradio.com/ [webdevradio.com] - it's a great tool for helping visually walk through the regex creation process, especially for complex needs.
  • by Anonymous Coward on Wednesday August 02, 2006 @05:20PM (#15835122)
    Save yourself $14.80 by buying the book here: Java Regular Expressions [amazon.com]. And if you use the "secret" A9.com discount [amazon.com], you can save an extra 1.57%! That's a total savings of $15.20, or 38.58%!
  • Re:Java sucks (Score:2, Informative)

    by vingilot ( 218702 ) on Wednesday August 02, 2006 @05:29PM (#15835191)
    Come on:
    "Some String".replaceAll("Java", "Bloated piece of shit")

    And FYI PatternSyntaxException is a runtime exception so no need to catch it and rethrow as a RuntimeException.

    so to write it your way:

    String theTruth(String s){
            return Pattern.compile("Java").matcher().replaceAll(s);
    }

  • Re:Java sucks (Score:5, Informative)

    by Derkec ( 463377 ) on Wednesday August 02, 2006 @05:38PM (#15835271)
    You don't have to throw anything there, you should just have one clear return in your method. You also probably should't be compiling your pattern every time.

    Try:
    private static final Pattern pattern = null;
     
    static {
      try { pattern = Pattern.compile("Java"); } catch (PatternSytaxException pse) {;}
    }
     
    public String theTruth(String string) {
      Matcher matcher = pattern.matcher(string);
      return matcher.replaceAll("something I don't know jack shit about");
    }
    Still not as compact but at least there aren't any tildes in there. I wonder if there would be a more compact way to do it. This seems terribly heavy weight for such a simple example. Oh, wait! There is!
    public String theTruth(String string) {
      return string.replaceAll("Java", "this is really easy");
    }
    So now we compare:
    public String theTruth(String s) { return s.replaceAll("Java", "this is easy") };
    To:
    sub theTruth($) { shift; $_ =~ s/Java/Bloated piece of shit/; return $_; }
    So the Java code ends up being a handful of characters longer and much easier to read. I'm not saying that Java is the ideal Regex language, but your example sucked.
  • Re:regex coach (Score:2, Informative)

    by sickofthisshit ( 881043 ) on Wednesday August 02, 2006 @06:47PM (#15835707) Journal
    This tool, by the way, was written in Common Lisp, using Edi's own library

    CL-PPCRE [weitz.de] - portable Perl-compatible regular expressions for Common Lisp

    A library which typically outperforms Perl's own regex engine.
  • by The Snowman ( 116231 ) * on Wednesday August 02, 2006 @07:30PM (#15835983)

    Here is the class I assume the parent is referencing: Formatter class [sun.com].

    Essentially what happens is you don't have C-style varargs, the JRE silently creates an array for you when you pass the arguments. This doesn't waste "gobs of heap space" like the parent says, it uses the same amount as it would using the stack. Remember, these are objects, and Java never passes objects by value -- always by refence. So each argument wastes one machine word (usually 32 bits). Whoop de fucking doo. And, since it uses references, the only allocation/deallocation is the temporary array. And in 1.5 if not previous versions, this is very very fast. With a JIT compiler you'll hardly notice it. I do agree that the decision to make the class "final" is shitty, but honestly, I don't see how subclassing it would be a huge advantage. It would be like subclassing the java.lang.String class. Sure, you could add some nifty stuff, but it's not a big deal.

    As a person who earns his living off of J2EE, I know its strengths and weaknesses. I am not a fanboy, however. I am more than willing to give Java hell when it deserves it. I think string handling in general is not as well-organized or easy to use as it could be, but it is certainly capable. I rarely use sprintf() style string formatting anyway, even in C++. I find it much easier to use iostreams, which are typesafe and almost as fast as sprintf(). In Java I just use string concatenation, and the formatting classes when I need it. It isn't perfect, but it works well enough and sure isn't slow.

  • Comment removed (Score:5, Informative)

    by account_deleted ( 4530225 ) on Wednesday August 02, 2006 @07:35PM (#15836016)
    Comment removed based on user account deletion
  • Re:My main complaint (Score:3, Informative)

    by Chris Pimlott ( 16212 ) on Wednesday August 02, 2006 @11:24PM (#15837077)
    If you use Jakarta Commons-Configuration [apache.org], there's basically no extra code to use XML configuration files.

    For example, the regex defined here:
    <foo>
        <bar>
            <regex>...</regex>
        </bar>
    </foo>
    becomes simply "foo.bar.regex", just like a standard properties file.

Those who can, do; those who can't, write. Those who can't write work for the Bell Labs Record.

Working...