Comment Re:Here's my challenge... (Score 1) 252
I have spent much time parsing poorly written HTML pages, and find that if I read the whole file into a string, and then substitute all whilespace characters for a space, all of the multi-line problems (and many others) go away, because your data is now only one line...
This works with HTML because the "format" of the data is imbedded in the tags, not the physical formatting, but I have used a similar approach when parsing logfiles that attempt to be "user friendly" and wrap long lines -- now each line of the file may or may not be a complete record. To get one record per line, join them all together, and split them on the "timestamp" field, and now you have a bunch of single line records to work with. If there isn't a timestamp, is there another way to determine the beginning (or end) of a record?
Obviously, you cannot always reformat the data file, and if you cannot change the actual files, make a copy and modify the copy
There might not be an easy way, but there should be a way -- you just have to keep working on it!
Mark