Big Data Sources – Web, Social media and Text Analytics Piet Daas,
14 Slides1.12 MB
Big Data Sources – Web, Social media and Text Analytics Piet Daas, Olav ten Bosch, Ali Hürriyetoglu, Dick Windmeijer THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Eurostat
ESTP Big Data training course nr. 3 Overview Hands on (learning by doing) Learn how to: Collect ‘data’ – from Web pages and Social media Process ‘data’ Analyse ‘data’ Learn how to extract information from textual data - Text mining, text analytics, Natural Language Processing 2 Eurostat Eurostat
Overview Day 1 Introduction Social media and official statistics Exercise: Create ‘keys’ for Twitter API access Exercise: Connect to Twitter API Exercise: Get user, profile and tweets (in your own language) 3 Eurostat Eurostat
Overview Day 2 (2) Web scraping explained Exercise: Use web robots Web scraping tips and tricks Exercise: Learn how to collect data from websites Feedback Day 3 Text mining and topic identification of tweets Exercise: Analyse tweets: identify topics Sentiment analysis Exercise: Analyse tweets: sentiment & more Natural Language Processing Demonstration Exercise: Extra time for more advanced analysis Eurostat Eurostat 4
Overview (3) Day 4 Text mining of web pages Exercise: Analyse document: content Exercise: Analyse web sites: content & topics Overview of the course & dealing with private data Exercise: Time to redo exercises/extra work Feedback Wrapping up, removing data 5 Eurostat Eurostat
Why analyse text? Texts are a source of information not commonly used in official statistics Potential applications are, automatically: Classify answers to open questions Code description of jobs/educations/products Identify activity code of companies from web site text Detailed product identification from descriptions on web sites Classify cause of death from medical reports Sentiment analysis of messages 6 Eurostat Eurostat
Why analyse text? (2) It is therefore important to: Learn how to extract information from textual data This training course will focus on this topic Goal is to learn the basics by a hands-on approach Is a starting-point for more advanced studies Key steps are: collection, processing and analysis Obtain insights in methods and approaches that 7 can be applied to extract information from texts Eurostat Eurostat
Examples of interesting books Manning (1999). Foundations of Statistical Natural Language Processing. MIT Press. Feldman and Sanger (2007) The Text Mining Handbook, Cambridge Univ. Press. Kao, Poteet (2007) Natural Language Processing and Text Mining, Springer Manning, Raghavan and Schütze (2008) Introduction to Information Retrieval, Cambridge Univ. Press. Weiss, Indurkhya, Zhang (2010) Fundamentals of Predictive Text Mining, Springer Aggarwal, Zhai (2012) Mining Text Data, Springer Miner, Elder, Fast, Hill, Nisbet, Delen (2012) Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications, Elsevier 8 Eurostat Eurostat
Practical tips Use our laptops Dual boot Windows / Linux Need to collect your own data! Connect to WiFi (CBS-Public) Web robots: via browser plugin (Windows) Twitter data: either in R or in Python (Linux) Python Notebooks will be distributed 9 Eurostat Eurostat
R-packages for text analytics tm: Text Mining Package A framework for text mining applications within R NLP: Natural Language Processing Infrastructure Basic classes and methods for Natural Language Processing SnowballC: Snowball ‘stemmers’ An R interface to the C libstemmer library Currently supported languages are Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish and Turkish. stringr: Wrappers for Common String Operations A consistent, simple and easy to use set of wrappers for string operations. wordcloud: Word Clouds For pretty word clouds RColorBrewer: ColorBrewer Palettes Provides color schemes for maps (and other graphics) twitteR: R Based Twitter Client Provides an interface to the Twitter web API More info: https://cran.r-project.org/package name package Eurostat Eurostat 10
Text analytics libraries for Python NLTK: Natural Language toolkit TextBlob For topic modeling and similarity detection Pattern Fast NLP implementation Gensim Built on top of NLTK, especially useful for beginners spaCy Collection of NLP tools Web mining module for Python and more Pyparsing For parsing text 11 Eurostat Eurostat
Essential step for Twitter studies 12 Eurostat Eurostat
Create keys for Twitter API access Make sure you have a Twitter account If not, go to https://twitter.com/signup Login and visit https://apps.twitter.com/app/new Fill in a name, description, web site and agree Copy all keys and tokens (all four), paste them in a text file and save this!! (don’t share them) You will need them during this course!! 13 Eurostat Eurostat
Eurostat Eurostat 14