Comment Re:That's copyright for you (Score 1) 292
All the embedded links are relative, actually. The browser shows them as file:// URLs because they're in a local file. They do appear to be session-specific, and don't work for me, either, now that my session has timed out. I'd probably have to download the pages all over again to get updated links.
Here is the script:
#!
BASE="http://web.lexisnexis.com"
URL="... first page
N=1
while
___FNAME="$(printf "gacode%03d.html" $N)"
___wget -T5 -t3 --no-cookies --header "`<cookie-header.txt`" -O "$FNAME" "$URL" || break;
___NEXT="$(xmllint --html --xpath 'string(//a[img/@title="Next"]/@href)' "$FNAME" 2>/dev/null)"
___[ -z "$NEXT" ] && { echo "No next URL." 1>&2; break; }
___N=$[N+1]
___URL="$BASE$NEXT"
done
(Leading spaces were replaced with underscores to preserve layout.) The file "cookie-header.txt" needs to contain the contents of the header, including the "Cookie:" prefix, as transmitted by your browser. You can get this by using Wireshark and the "Follow TCP Stream" function, among other methods.