Text mining and machine learning: examples from life Evgeny
13 Slides597.91 KB
Text mining and machine learning: examples from life Evgeny Klochikhin, PhD American Institutes for Research Tech Talk - DCDataFest 2015
Rule #1: TEXT IS NOT NUMBERS Example: The down is falling down. 2015 Evgeny Klochikhin, PhD American Institutes for Research
Rule #2: METHOD DEPENDS ON APPLICATION Use cases: - Text categorization - Validation of record linkage - Knowledge discovery - Document clustering and classification 2015 Evgeny Klochikhin, PhD American Institutes for Research
Use case #1: Text categorization Where do the categories come from? Do we have definite number of classes or let the machine decide? Are there any additional variables (e.g. metadata)? Choices: topic modeling, information retrieval, machine classification 2015 Evgeny Klochikhin, PhD American Institutes for Research
Use case #2: Knowledge discovery Do we know what knowledge we want to discover? Is there a ‘gold standard’ data set, or ground truth? Choices: information retrieval/NLP, active learning, machine classification 2015 Evgeny Klochikhin, PhD American Institutes for Research
Rule #3: MAKE SURE SOFTWARE IS ROBUST Examples: - Topic modeling: Mallet vs gensim - Explicit Semantic Analysis: EasyESA vs esalib2 2015 Evgeny Klochikhin, PhD American Institutes for Research
Rule #4: NOTHING IS FULLY AUTOMATED Humans should always be involved (curate, validate, ground truth) Examples: - General corpora: Mechanical Turk and Crowdflower - Scientific corpora: expert curators 2015 Evgeny Klochikhin, PhD American Institutes for Research
Implementation: usual steps Data collection Data organization Data cleaning Pre-processing: remove common stop words, tokenize, TFIDF Apply method Post-processing: validation and evaluation 2015 Evgeny Klochikhin, PhD American Institutes for Research
TOPIC MODELING 2015 Evgeny Klochikhin, PhD American Institutes for Research
What is text: ‘bag-of-words’ Vector space representation of text – every word has its unique id (e.g., ‘microscopy’ 0, ‘afm’ 1, ‘topography’ 2, ‘nanoscale’ 3, etc.) and the number of occurrences within the document: Award 0814615: Systems Approach to Dynamic Atomic Force Microscopy Abstract The goal of this project is to establish a framework for model based simultaneous topography and parameter estimation in the amplitude modulation atomic force microscopy (AFM). Parametric models of tip-sample interaction that are amenable to realtime identification will be developed. Harmonic balance and power balance tools will be incorporated towards the estimation of the model parameters. The amplitude and phase dynamics based on the model will be developed, which will be used to validate the model with experimental data and subsequently used for control design purposes. These methods will be used to study yeast cells. A framework for non-parametric reconstruction of tipsample interaction potential will be researched. Limitations on how well amplitude modulated AFM can decipher different sample interactions will be studied # of instances 5 4 3 2 1 0 0 5 10 15 20 25 30 35 2015 Evgeny Klochikhin, PhD American Institutes for Research 40 45 50 55 word IDs
What is topic modeling (D. Newman) The topic model is an algorithm that automatically learns topics (themes) from a collection of documents – It works by observing words that tend to co-appear in documents, for example gene and dna, or climate and warming – The topic model assumes each document exhibits multiple topics – The topic model learns topics directly from the text Each topic is displayed by showing its top-20 words, for example: – dark matter cosmological cosmology universe dark energy lensing survey CMB redshift cosmic mass galaxy scale galaxies gravitational measurement power spectrum parameter observation structure . – This is a topic about Dark Matter, Dark Energy and Cosmology 2015 Evgeny Klochikhin, PhD American Institutes for Research
Examples Abstract excerpt Engineering for food safety and quality The food industry is one of the most conservative among industries in the United States; it is experiencing, like never before, the need for change, for innovation. Consumers are much more demanding and better educated in terms of food quality and nutritional aspects, regulatory agencies are searching for technologies that offer better products with greater safety Top-3 topics pathogen foodborne safety farm contamination control intervention food-borne borne reduce Probability scores 0.32 poultry campylobacter jejuni chicken salmonella broiler egg colonization avian vaccine 0.32 symptom abdominal treatment vomiting cramp protect patient dos vaccine testing 0.16 Edible coatings to improve food quality and food safety and minimize packaging cost An edible film resembles plastic film wrap but is formed from renewable edible protein (e.g., milk protein) and/or polysaccharide (e.g., cornstarch). Edible films can be used as food wraps or formed into pouches for foods, thus reducing use of synthetic plastic films. Edible films can also be formed directly on the surfaces of the food as coatings to protect or enhance the food in some manner, becoming part of the food and remaining on the food through consumption. produce fresh outbreak coli contamination pathogen spinach lettuce salmonella o157 0.53 mycotoxin aflatoxin fungi fungal grain aspergillus feed flavus toxin fusarium 0.15 detection rapid phase method detect pathogen assay sensor sensitive biosensor 0.09 2015 Evgeny Klochikhin, PhD American Institutes for Research
Software MALLET - http://mallet.cs.umass.edu/ Sample steps: – Import documents: bin/mallet import-dir --input /data/topic-input --output topic-input.mallet \ --keepsequence --remove-stopwords – Build the model: bin/mallet train-topics --input topicinput.mallet \ --num-topics 100 --output-state topicstate.gz – Inference topics: bin/mallet infer-topics --inferencerfilename [FILENAME] 2015 Evgeny Klochikhin, PhD American Institutes for Research