Writing Software to Collect Click Stream Stats? 16
AntiPasto asks: "I am working with a small business that wants to evaluate their "click streams[?]". I've investigated openstats and even commercial products like Funnel Web as being turn-key solutions, but they don't offer the sort of authenticated-user page-view detail that we're looking for. I've since decided to start a mod_usertrack implementation, but it looks like we need to write our own stuff to process this. Anyone have any experiences with tracking a user's visit?"
Semester Project... (Score:4, Informative)
Read a line in.
Look up the IP address and time in the session information.
IF that IP does not exist, make a new session.
IF that IP does exist:
Is the time we just read within (you define) minutes of the current end time in the database for the last session?
Yes: Set end time of this session to the time just read in.
No: create a new session (same IP, but session ID is just one greater than the last)
For example, someone loads just one page from your site, then they would have an row in the session information table with their IP and a session ID of 0 and their start and end time would be the same. If they load a page again (then there would be another line in the apache log for this) within the time you set then the end time of that session is just set to the time of the most recent load of a page in that session. We have another table called PageSession where we list the IP, session ID, and page ID for all pages accessed in the session. Note: we distinguished between html, htm, txt, php, etc... content and other (jpg, gif, mpg, etc...) into different tables so we can query just html info or just picture info.
Other than IP we don't "authenticate" the user. We put in place the means to try to weed out dial-up users vs. static IP users, but this is by no means a well implemented as of now thing since it relies on knowing the domain names of dial-up ISPs or looking for keywords like "dial" in the hostname of a page from a known ISP that has several types of connection. With that asside, I don't know what means you have your authentication, but I don't see why your authentication couldn't be tied into (or inplace of) our use of IP to denote a specfic user.
Our initial goal of the project was just to look at date and time info about the sessions with the extension of page x, y, and not z (hehe, "not z") reports available from our design. Part of the problem with looking for a corelation between pages visited on your site during a session (and referal URL stuff, too) is simple data mining algorithms usually have a threshold in there to look for "interesting" relationships. You would supply an expected percentage for the number of sessions involving certain items. For example, if your main page links only to a sub page then you'd expect a high degree of relationship between those, but if a page burried deep down links to no pages it might be interesting to see how a person got to it. A tricky thing this threshold is, but the info you get about unexpected things is amazing.
I recall an anecdote about supermarkets looking for unexpected item sales and they found out that there was a higher than expected percentage of people buying beer and diapers on a Friday. It is suspected that men were doing the shopping at this time and had to get some things for the family and they had their own priorities... Supermarkets have it easy since people tend to buy all of their items in one transaction, the web is a gimmey-now-one-at-a-time type thing so defining a session is also an art, and I'd suspect would vary greatly depending on your site's content, target audience, bandwidth, etc...
Well, I guess I should end my rambling as tomorrow evening I have to give a presentation on this project and end my semester (finals are next week). Hope there is some nugget of info in there that helps!
Re:Semester Project... (Score:2)
One thing jumps out at me from that - proxy servers. We've got 10 or so users sharing an IP behind a proxy here. Your technique won't differentiate between us...
What about just setting an ID cookie with a short timeout but making every page bump the expiry up? OK, you've got a job identifying images that way, but it handles proxy servers and you can still use IP addresses for them.
Re:Semester Project... (Score:2)
Here's another way we looked at it. You have 10 people sharing the proxy, suppose 2 or 3 hit my site at the same time, you would be lumped together into a single session. Fine, then sessions do not mean a specific user hit the site, just a specific IP - that's all we really want. We know that IP could mean 10 people seeing things at a presentation, 10 behind a proxy, 10 who dial in, click, hang up, someone else dials into same IP, click, hang up... We know there's going to overlap and error, but in the grand scheme of things I'd think it'd all work out in the wash.
We just wanted a simple and as non-invasive as possible means to gather information. We see cookies as invase (not that I have a problem with them, but since some do then we have to treat them as such) and relying on the user to be as honest as possible.
This "problem" of proxies and such is actually eliminated entirely under the conditions the "Ask
Proxies drasticly change number of IPs (Score:1)
You should take proxies more into account.. many large companies shove tens of thousands of users behind a handful (or one!) IP. Some colleges do as well. And, oh yeah - AOL. 10 million people or whatever, and they all use a handful of IPs. Don't foget the cable modem companies, the DSL companies, and all the little ISPs that encourage (or force) users to use their proxies.
If you rely simply on IP, not only would your sessions not make any sense in any kind of "he went here, then there, then there" kind of sense, but you'd vastly underestimate the number of users/sessions.
Re:Proxies drasticly change number of IPs (Score:2)
Re:Proxies drasticly change number of IPs (Score:2)
Another job of the data miner is to look for these types of anomolus trends and account for them. As you see odd jumps around from things that look like proxies, add a rule that you feel it's from a proxy. Will this catch them all? Heck no. Even if you miss some, you can still get some bearings on the popularity of some pages with respect to others on your site.
Underestimating the session information I think is a good thing. The goal of session identification (atleast for me) was to reduce the reliance on these big "hit count" numbers. But to each his own...
Re:Proxies drasticly change number of IPs (Score:2)
Is my way an end all - no. Is it better than cookies - oh the fun on a
I see cookies as pass/fail type things. Either you get good data or you don't. My way I think has a little grey area where there are the obvious few clicks here and there from someone with a dedicated IP and there are those that are more tricky, but with some good coding I feel there are means to clean up the data and make some better judgements over the first impression of "he's stupid to be just relying on IPs, there's nothing one can get form that nonsense." I think a more appropriate statement covering many means of web traffic analysis would be "He's silly to be relying on a single means of data interpretation and even sillier if he thinks there exists a solution that does not have a senerio circumventing it."
Depends on the implementation (Score:2)
Otoh, if you use a single driver script/servlet/jsp, and dynamically produce content based on form variables, then your driver must handle the reporting, because the server log isn't going to report anything except the base URL for every request. In this case, your driver needs to log what is happening before it serves up the appropriate content.
Name lookups can be tricky (Score:1)
One issue that I have noticed is that unless you are serving scripts, hostnames in the logs are becoming less relevant due to caching and proxies. However, if you do track host names then getting a 100%(ok 99%) accurate list is hard -- If you wait a few days, weeks or months to analyse your logs, then some of the IP may have changed owners. If you try to let the HTTP daemon do the lookup, then you suffer a drop in performance and the very first lookup or two often fails anyway.
If you use Apache (I have no experience with the others), then it is easy to pipe the log output directly to a script.
You can even make it tab-delimited or what ever it takes to be easier to parse.Myself, I'm about to experiment with logrotate [rt.com], rotatelogs [apache.org], cronolog [ford-mason.co.uk] and mod_mylog [sourceforge.net]. mod_mylog puts the log output straight into your RDBMS and even claims to cache records if the RDBMS is temporarily unavailable.
Try this... (Score:2, Informative)
Theres an article about phpopentracker at trafficmanager (http://www.trafficmanager.co.uk/reviews/article.
There are also discussion forums for this kind of thing at http://forums.trafficmanager.co.uk
Don't re-invent the wheel (Score:2, Informative)
Ralph Kimball's the guy here (Score:3, Informative)
He's done massive amounts on Datawarehousing and has abook on the subject of looking at clickstream wrt to datawarehouse techniques.
check out www.rkimball.com and also stuff on datawarehousing if you are going to really get into all this
who's who & what are they doing. (Score:2)
granted, i don't get that many (2-3,000 a month) and a substantial amount are people i know, but it sure is fun to do.
Setting a cookie will let you pinpoint that a given instance of netscape is viewing your site. with lack of cookies you can tell that a given ip is viewing your site, but then proxies will get you in trouble and may get you some nice email [blackant.net] (that one deals with those pesky nipr.mil people). Using ip, timestamps and useragent can get you a more accurate pinpointing, but still not exact.
I like to set an invisible image at the bottom of my website with a special id on it. This image gets changed at onunload which, for those who use javascript and unload the page, will tell me how long they viewed any given page. [blackant.net] (some of this info is also presented at the bottom of every page).
If you really wanted to and could afford a fully dynamic site you could have every single link called with a ?sessionid at the end of it (like http://www.example.com/?1234) and have this reset if the referer wasnt from your site (this could cover people copying and pasting that link to someone else) and then parse through your logs afterwards. but that could get annoying.
as far as thelogs go, a friend of mine has apache log directly to mysql which facilitates his parsing. as another poster mentioned, sniffing traffic can help alleviate strain on your webserver - a nice openbsd bridging firewall will do the trick. (checking your firewall logs is handy in other ways - i have some hidden "easter eggs" on my site which appear to be exploits on my box - i check who gets to those pages and then who tries to connect to port XX [blackant.net] and see what ip's match up - nice little stats)
One time i wondered what would happen if i matched the ips from my mail headers to the ips from my weblogs [blackant.net]. It turns out that around %1 of the unique ips in my mail headers also appeared in the weblogs, which means that with pretty fair certainty i knew who was browsing my site. But this mainly works due to the personal natureof my site.
On the other hand, if you've access to lastlogs, query logs, and weblogs you can really start identifying local users of your website. i work at a local college and can learn a lot about a particular viewer by seeing if the same ip is logged in to a given server, or by looking at the query log and seeing what else they've done dns lookups for. Add in a messaging system [blackant.net] and you can freak people out. (i also use this to freak out people who search images.google.com for 'breast' and get to my site (i'm a photographer too)).
one other thing i find useful is to keep track of who searches for robots.txt. this can be an indication that someone is a robot or proxy. It also helps me present special information to search engines, allowing them to know of (and thus index) a new page the next time they get to my site (i put a couple special links if you access robots.txt)
a friend of mine runs a journal/bbs website [afrodiary.com] and was wondering about tracking his users when they create different accounts. We are thinking about implementing something similar to my spam identification [blackant.net] to identify similar writing styles and possibly the same people in different accounts.
Once you've gathered up some data you'll want to look at it in a nice way. you could use excel or you could create some really nice webmaps [blackant.net] (that site also has links to similar mapping projects).
finally a word of advice, if you put up a page of your refer logs, include that as disallowed in your robots.txt or you will get a lot of strangely referred people.
oh, and keep in mind no one method will be %100 accurate, but a combination of methods can get you close.
and there was an article not too long ago about MIT (i think) doing studies into howpeople view webpages - that is, if mouse is over to side of screen then person is most likely reading page, if mouse in middle of page then probably not, etc).
maybe forcing every viewer into a frameset and then tracking changes in the subframes is a viable option. associate a frame change to a hidden image change with an encoded identifier.
ok, thats it for now.