Text Mining & Natural Language Processing Ali Hürriyetoglu, Piet
29 Slides445.05 KB
Text Mining & Natural Language Processing Ali Hürriyetoglu, Piet Daas THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Eurostat
Outline Introduction Background Basic steps Use cases Machine learning for text mining 2 Eurostat Eurostat
Introduction 3 Eurostat Eurostat
What can you do with text mining? Named entity recognition Sentiment analysis Topic detection Information extraction Trend detection Clustering similar documents Automatic summarisation 4 Eurostat Eurostat
Ingredients of text mining Text analytics is a function of: The The The The amount and type of text you have task you want to achieve precision and recall you want to get time you can spend 5 Eurostat Eurostat
Text types Semi structured language use: Address, phone number, named entities, etc. Standard text: News articles, books, etc. User generated text: social media, comments 6 Eurostat Eurostat
Background 7 Eurostat Eurostat
Text Text is a rich combination of symbols that lead to a structure which has a context dependent interpretation. Symbols: character, word, punctuation, digit, emoticon Structure: tokens, links, user names, hashtags, noun, verb, named entity, emoticon, phrases, codes, etc. Context: writer, genre, platform, social environment, time, geographic location, etc. Interpretation: sense, meaning, 8 Eurostat Eurostat
Symbols Letters: A B Ç X Digits: 1 5 3 2 Punctuation: . , ! ? Emoticons: Special characters: # & Eurostat Eurostat
Structure Tokens: Any space separated symbol sequence (for European languages). Numbers: 6, 123, , Web specific tokens: user names, hashtags, URLs, Abbreviations: vs., etc., . Syntactic interpretation: noun, verb, adjective, . 10 Eurostat Eurostat
Context Anything about use of a token may have significant effect: The person who uses it The aim of the phrase Time and place of the language use Preceding and following expressions . 11 Eurostat Eurostat
Interpretation Tokens and phrases may have one or more interpretations. Ambiguity: Lexical meaning may differ Named entities: same entities names may refer to different real entities Genre: Orders, compliments, statements, instructions, etc. Usernames: will be interpreted differently in different platforms 12 Eurostat Eurostat
Basic steps 13 Eurostat Eurostat
Basic steps and tools You need some combination of: Language identification Sentence splitting Tokenization Lemmatization Anaphora resolution Regular expressions POS tagging Named entity recognition Parsing methodology, Pyparsing Language resources: stop words, a sentiment lexicon, multi-word expressions, ontology, etc. 14 Eurostat Eurostat
Use cases 15 Eurostat Eurostat
Named entities Problem: You want to know which named entities are available in a text. You do not have much time or resources. An approximate result is sufficient for you. Solution: Find and count all proper-cased token sequences: ([A-Z][a-z] (\s[A-Z][a-z] ) ) ('Sherlock Holmes', 90), ('United States', 71), ('New York', 54), ('New England', 46), ('Baker Street', 29), 16 Eurostat Eurostat
Street names Problem: You have a set of criminality reports. You wonder which street names are mentioned mostly. Solution: Write a more specific regular expression: [A-Z][a-z] [sS]treet ('Baker Street', 29), ('Leadenhall Street', 5), ('Fresno Street', 2), ('Fenchurch Street', 2), ('Bow Street', 2), ('Oxford Street', 2), 17 Eurostat Eurostat
Detect economic indicators Problem: You want to detect and track price changes. You want to be precise. You know and can spend some time to specify what you are looking for. Solution: Parse text with Pyparsing* action oneOf(["lower","increase","decrease"], caseless True) econ oneOf(["prices","expense","cost","price"], caseless True) item Word(alphas) economy grammar action("action") item("item") econ economy grammar2 econ Literal("of") item action *For R use tm package Eurostat Eurostat 18
Sentiment Analysis Problem: You want to understand how people feel about a certain issue or entity. Solution 1: Create or use an available sentiment lexicon. Count number of occurrences for the entries in the lexicon. Solution 2: Detailed syntactic and semantic analysis. 19 Eurostat Eurostat
Wordclouds Problem: You have text, and want to have a quick insight about what it mostly contains. Solution: Word cloud, streamgraph, t-SNE, 20 Eurostat Eurostat
https://github.com/amueller/word cloud/blob/master/examples/constitution.png Eurostat Eurostat 21
Track co-evoluation of language use https://blog.twitter.com/2010/the-2010-world-cup-a-global-conversation Eurostat Eurostat 22
Topic modelling Problem: You need a detailed analysis of the topics in a text collection, corpus. Solution: Topic modelling 23 Eurostat Eurostat
http://alexperrier.github.io/jekyll/update/2015/09/04/topic-modeling-of-twitter-followers.html Eurostat Eurostat 24
Machine learning 25 Eurostat Eurostat
Machine Learning You can attempt to solve almost any text mining task with machine learning approaches. The outcome will depend on: Feature extraction and selection Amount of labeled data in the case of supervised learning Time you have to analyze the output in unsupervised learning 26 Eurostat Eurostat
Thanks for listening! Any question or comment? 27 Eurostat Eurostat
Exercises 6) Search for key terms on Twitter and collect n tweets (n 200) 7) Determine most frequent hashtags, links, mentions 8) Create wordcloud of these tweets 9) Topic detection from tweets (either user or key terms search result) 10) Sentiment analysis, create your own list of 10 positive and 10 negative words, calculate count based score 11) Look for an online classifier (for the language of your tweets), get access key and test it (watch the rate limit) E.g. MonkeyLearn 12) Study emoticons as an example for basic emotions Eurostat Eurostat 28
Additional exercises Additional tasks: 13) Detect place name, person name, organisation name, number, date recognition, geolocation/temporal characteristics, find similar tweets 14) Apply t-distributed stochastic neighbour embedding (t-SNE) visualization technique on tweets 29 Eurostat Eurostat