Log Mining CSE 454 Eytan Adar November 28, 2007
35 Slides1.29 MB
Log Mining CSE 454 Eytan Adar November 28, 2007
So far . Building massive services – Crawling data – Processing data (mining/machine learning extractions/etc) – Indexing data – Serving data Now what?
Behavior Hopefully at this point you actually have users Users interact, use, and add content As an (information) side-effect – Leave traces behind We would like to make use of this – Understand our demographics – Improve the service
Logging Web Activity (Review) Most servers support “common logfile format” or “extended logfile format” 18.1.13.12 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache pb.gif HTTP/1.0" 200 2326 Apache lets you customize format Every HTTP event is recorded – – – – – Page requested Remote host Browser type Referring page Time of day Cookies Other instrumented information can be passed in URLs (e.g. URL rewriting)
Simple Stats (task 1) Building a basic analytics site Use over time How many users Where they are Where they come from What they look at
Leveraging the Data More advanced modeling of behavior Improve user interface (designer perspective) – – – – – What are people looking at? What/where are they clicking on? Where do the enter a site? Leave? Repeated behaviors? Is today different than yesterday? Did my redesign have an impact? – How are my ads doing? Where should I put them?
User Tracing Trace a user through a website Commercial vendors – SPSS – SAS – WebTrends – ClickTracks – (see http://www.kdnuggets.com/software/web-mining.html for a ton more)
Clickdensity Maps
User Tracing Trace users through a website VisitorVille
User Tracing Trace a user through a website VisitorVille
User Tracing Trace a user through a website VisitorVille
User Tracing Trace a user through a website WebQuilt, Berkeley (Proxy based solution)
User Tracing Trace a user through a website Ed Chi, Xerox PARC
User Tracing Tracking users is tricky Why? What’s a user? – Proxies, multiple accounts, cookies lost, robots What’s a session? – – – – Back button, caching The “bathroom” problem Are they doing something new? Entering and leaving De-heading Are they really done Bookmarks
User Tracing Tricks of the trade – Cookies help – Force cache flushing – Javascript (“bugs”) – Time based session delimiters Fixed (30 minutes) Adaptive (Calculate based on inter-arrival times) – Referral logs
Leveraging the Data More advanced modeling of behavior Improve user interface (designer perspective) Automatically modify the service – Guidance (good next place to go ) – Personalization (you would like ) – Better index (try this query ) – Security
General Techniques Association rules – If (1.html & 2.html 3.html) – Standard ML algorithms Repeated Patterns – 1.html 2.html 3.html is common path – Statistics, motifs, and “sequence” alignment Clustering – Digraph, users to pages
PageRank Behavior Implicit links, find where people go P1 P2 P3 P4 Calculate ranking based not only on real links but also implicit ones
Guiding Many User Suggesting where to go based on previous trails/footprints WebWatcher
Guiding Users Suggesting where to go based on previous trails/footprints Do things dynamically – (if a.html b.html suggest c.html) MINPATH (Anderson et al.) – Mobile users don’t want to “surf” – Learn the paths – Suggest shortcut links Personalized site maps (Toolan/Kusmerick)
Query Logs Specific (important) type of web log analysis Users presented with SERPs (Search Engine Result Page) Behavior logged as: [user info] date “query” #results result-clicked clickthrough
Same analysis issues What’s a user? – Somewhat easier (lots of instrumentation) What’s a session? – Users are hitting the back button a lot – Users also re-search a lot Maybe even 40% But, we also want search sessions – “Seattle Basketball” “Sonics” “Seattle SuperSonics”
New Analysis Issues Query refinement – tracking sessions is harder Type Example Capitalization Air France air france Word order New York Department of State Department of State New York Stop words Atlas of Missouri Missouri Atlas Words Swaps American Embassy American Consulate Abbreviations British Airways BA Misspellings Yahoo Yahho Extra Words/Phrases Six Flags Six Flags New York Reformulations United Nations Secretary General Kofi Annan Synonyms Practical Jokes Pranks
Generating Sessions Some of it is easy – Normalize – Re-order – Drop stopwords Hard part for us is the same thing that’s hard for users – Spelling mistakes, better queries, etc. – Advantage of scale
Spelling Mistakes People frequently make the same typos – Yahho Many correct themselves – Yahoo Task find pairs of queries that always come one after the other – Yahho Yahoo
Synonyms/Reformulations R1 UN Sec. General Q1 R2 Kofi http://foo.bar. http://a.b.c. R3 http://1.2.3. R4 http://6.7.8. Q2
Synonyms/Reformulations Combine ideas Suggest queries Q1 Q2 Q3 Q4
Improving Query Results What’s a good result? Heuristic of the last click – Users fail in lots of ways – Usually succeed in one – Find the most popular, last clicked result in a session Problem: most popular last click is almost always result #1 – Can “test the waters” by occasionally swapping results
Automatic Improvements The DirectHit algorithm R1 85% Q1 5% 2% 1% R2 http://foo.bar. http://a.b.c. R3 http://1.2.3. R4 http://6.7.8.
Automatic Improvements The DirectHit algorithm R1 2% Q1 5% 85% 1% R2 R3 R4 http://foo.bar. http://a.b.c. http://1.2.3. http://6.7.8. Move to top
Security/Spam Issues Issues in taking account what users do? – Robots – Malicious users These are aggressively removed – Too many queries, too quickly Personalization helps – Limits impact to one person/small group
Automatic Improvements Personalization User 1, Q1 User 2, Q1 User 3, Q1 Search Engine Results
Automatic Improvements Personalization User 1, Q1 User Model Results User 2, Q1 User Model Results Search Engine User 3, Q1 User Model Results
Personalization Learning the user model – User tells us what they’re interested in (categories of pages or specific pages) – We infer what they’re interested in Pages with Apple AND Farm Pages with Apple AND Computer – User model “boosts” certain word scores Remember TFIDF?
Questions?