CMU SCS Peta-Graph Mining Christos Faloutsos Appel, Ana Chau,
25 Slides1.08 MB
CMU SCS Peta-Graph Mining Christos Faloutsos Appel, Ana Chau, Polo Leskovec, Jure Kang, U Prakash, Aditya Shringarpure, Suyash Tsourakakis, Charalampos Yahoo/Hadoop, 2008 #1
CMU SCS Our goal: One-stop solution for mining huge graphs Yahoo/Hadoop, 2008 2
CMU SCS Outline Centralized Hadoop Degree Distribution old old Pagerank old old Diameter/ANF old X Communities old X Triangles X todo Visualization X todo Datasets: (a) Synthetic (‘Kronecker’, 300M nodes, 1B edges) Yahoo/Hadoop, (b) NetFlix (20K movies, 500K2008 users, 100M edges) 3
CMU SCS Degree Distributions - NetFlix count 100 machines - 8min Yahoo/Hadoop, 2008 Movie in-degree 4
CMU SCS Degree Distributions - NetFlix count Theoretically expected 100 machines - 8min Yahoo/Hadoop, 2008 Movie in-degree 5
CMU SCS Degree Distributions - NetFlix count 100 machines - 8min Yahoo/Hadoop, 2008 User out-degree 6
CMU SCS Degree Distributions - NetFlix count Theoretically expected Sharp drop below 100 ratings 100 machines - 8min User out-degree Yahoo/Hadoop, 2008 7
CMU SCS Degree Distributions - Kronecker count degree 100 machines - 6h Yahoo/Hadoop, 2008 Nodes:259M - Edges: 1B 8
CMU SCS Degree Distributions - timings Time (sec) 24 tasks 48 tasks 1 task Yahoo/Hadoop, Edge 2008 file size (MB) 9
CMU SCS Outline Centralized Hadoop Degree Distribution old old Pagerank old old Diameter/ANF old X Communities old X Triangles X todo Visualization X todo Datasets: (a) Synthetic (‘Kronecker’, 300M nodes, 1B edges) Yahoo/Hadoop, (b) NetFlix (20K movies, 500K2008 users, 100M edges) 10
CMU SCS Diameter Diameter of a graph Maximum shortest path Normally, O(N**2) ANF : Approximate Neighborhood function’ [Palmer 02]: O(E) Goal : calculate neighborhood function Neighborhood N(h) : number of pairs of nodes within distance h Yahoo/Hadoop, 2008 11
CMU SCS Time (min) Diameter 1 node 48 nodes 28 nodes Edge file (MB) For large jobs, parallelization Yahoo/Hadoop, 2008 helps 12
CMU SCS Diameter / Hop Plot (Netflix) # of reachable pairs within h hops h: # of hops Yahoo/Hadoop, 2008 13
CMU SCS Diameter / Hop Plot (Netflix) # of reachable pairs within h hops Diameter: 3 h: # of hops Yahoo/Hadoop, 2008 14
CMU SCS Outline Centralized Hadoop Degree Distribution old old Pagerank old old Diameter/ANF old X Communities old X Triangles X todo Visualization X todo Datasets: (a) Synthetic (‘Kronecker’, 300M nodes, 1B edges) Yahoo/Hadoop, (b) NetFlix (20K movies, 500K2008 users, 100M edges) 15
CMU SCS Community detection Cross associations [Chakrabarti ’04] Yahoo/Hadoop, 2008 16
CMU SCS Community detection Yahoo/Hadoop, 2008 17
CMU SCS Outline Centralized Hadoop Degree Distribution old old Pagerank old old Diameter/ANF old X Communities old X Triangles X todo Visualization X todo Datasets: (a) Synthetic (‘Kronecker’, 300M nodes, 1B edges) Yahoo/Hadoop, (b) NetFlix (20K movies, 500K2008 users, 100M edges) 18
CMU SCS Triangles ‘friends of friends are friends’ Yahoo/Hadoop, 2008 19
CMU SCS Triangles ‘friends of friends are friends’ Yahoo/Hadoop, 2008 20
CMU SCS Triangles ‘friends of friends are friends’ Naïve algo: 3-way join (slow) [Tsourakakis’08]: # triangles sum of cubes of eigenvalues Thus, super-fast computation of #triangles (100x - 25,000x faster than naïve; 95% accuracy Yahoo/Hadoop, 2008 21
CMU SCS Triangles Easy to implement on hadoop: it only needs eigenvalues (to do, with Lanczos) Yahoo/Hadoop, 2008 22
CMU SCS Outline Centralized Hadoop Degree Distribution old old Pagerank old old Diameter/ANF old X Communities old X Triangles X todo Visualization X todo Datasets: (a) Synthetic (‘Kronecker’, 300M nodes, 1B edges) Yahoo/Hadoop, (b) NetFlix (20K movies, 500K2008 users, 100M edges) 23
CMU SCS Visualization Principled visualization of large graphs Yahoo/Hadoop, 2008 (show few most important’ edges) 24
CMU SCS Summary Centralized Hadoop Degree Distribution old old Pagerank old old Diameter/ANF old X Communities old X Triangles X todo Visualization X todo Goal: one-stop solution for mining huge graphs Yahoo/Hadoop, 2008 25