Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
PHP

Journal blazin's Journal: Regex gurus, please read 2

I am in need of some help. I'm trying to make a regular expression to parse a chunk of text but I am not having much luck. Mainly because a lot of the text may be optional.

The text may look like this:

Message . . . . : Object Q000006003 in COLOSYS02 type *USRQ deleted.

It could look like this:

From user . . . . . . . . . : QSYS
  Message . . . . : Subsystem is ending controlled.

It could even look like this:

Message . . . . : Job ended abnormally.
  Cause . . . . . : A SIGTERM signal was received for the job. The action for the signal was to terminate the job.

Or like this:

Message . . . . : Job 818753/ONEWORLD/JDENET_K ended on 06/22/05 at 15:53:39; 18 seconds used; end code 30 .
  Cause . . . . . : Job 818753/ONEWORLD/JDENET_K completed on 06/22/05 at 15:53:39 after it used 18 seconds processing unit time. The job had ending code 30. The job ended after 1 routing steps with a secondary ending code of 0. The job ending codes and their meanings are as follows: 0 - The job completed normally. 10 - The job completed normally during controlled ending or controlled subsystem ending. 20 - The job exceeded end severity (ENDSEV job attribute). 30 - The job ended abnormally. 40 - The job ended before becoming active. 50 - The job ended while the job was active. 60 - The subsystem ended abnormally while the job was active. 70 - The system ended abnormally while the job was active. 80 - The job ended (ENDJOBABN command). 90 - The job was forced to end after the time limit ended (ENDJOBABN command). Recovery . . . : For more information, see the Work Management topic in the Information Center, http://www.ibm.com/eserver/iseries/infocenter.

The formatting is not exactly as shown, but slashdot is !helpfully reformating some parts of the ecode tag. The main issue is I need to parse out each of the fields (From User, Message, Cause, Recovery, etc.). These fields may or may not show up in a particular message. This particular message type seems to have the fewest of these types of things to parse. Others that I need to do will have many more with all sorts of gotchas to look out for.

I'd really like to use regex for this since it seems to make things a lot simpler. The other option which seems to involve lots of strpos and substr calls is much uglier (uglier than regex's HA).

I've been trying to play with optional groups, non-greedy matching, etc. but not having a lot of luck. Any help would be much appreciated.

Oh yeah, this is PHP, so I don't have sed or other perl stuff available.

Thank you.

This discussion has been archived. No new comments can be posted.

Regex gurus, please read

Comments Filter:
  • its (subject) "dots and colon" (reason), right?
    Then do a:

    $line =~ /(.*) "dots and colon escaped out" (.*)/<BR>
    $subject = $1;
    $reason = $2;

    Of course, things get tricky when you handle the multiline, but nothing horribly complex (if the above regexp returns nothing and you have a subject from the previous line, append the whole line to the reason)


    Too many junk characters filter sucks. This is why I had to type in 'dots and colon' everywhere, so sub in the actual dots and colon instead of my

    • The only problem I see with that is that (.*) will match all the previous stuff. There's cases where the subject is only one word, other cases it is more than one word. "Recovery" will probably occur somewhere inside of the "Cause" section without any kind of newline or anything.

      Also, /(.*) "dots and colons escaped out"(different numbers of dots and colons) (.*)/ would match the entire message and wouldn't quit when another subject . . . : reason comes up.

      Thanks for the help though. It does give me some

Top Ten Things Overheard At The ANSI C Draft Committee Meetings: (5) All right, who's the wiseguy who stuck this trigraph stuff in here?

Working...