shotglasses - Slashdot User

Comment Re:Here's my challenge... (Score 1) 252

by shotglasses on Tuesday June 24, 2003 @02:51PM (#6286945) Attached to: Mastering Regular Expressions

If you have data where the multi-line matching isn't working, can you reformat your data in some way to get around the problem?

I have spent much time parsing poorly written HTML pages, and find that if I read the whole file into a string, and then substitute all whilespace characters for a space, all of the multi-line problems (and many others) go away, because your data is now only one line...

This works with HTML because the "format" of the data is imbedded in the tags, not the physical formatting, but I have used a similar approach when parsing logfiles that attempt to be "user friendly" and wrap long lines -- now each line of the file may or may not be a complete record. To get one record per line, join them all together, and split them on the "timestamp" field, and now you have a bunch of single line records to work with. If there isn't a timestamp, is there another way to determine the beginning (or end) of a record?

Obviously, you cannot always reformat the data file, and if you cannot change the actual files, make a copy and modify the copy

There might not be an easy way, but there should be a way -- you just have to keep working on it!

Mark

Slashdot Top Deals