slides in pdf
January 11, 2018 | Author: Anonymous | Category: N/A
Short Description
Download slides in pdf...
Description
7/26/2015
Successful Data Mining Methods for NLP Jiawei Han (UIUC), Heng Ji (RPI), Yizhou Sun (NEU) http://hanj.cs.illinois.edu/slides/dmnlp15.pptx[pdf] http://nlp.cs.rpi.edu/paper/dmnlp15.pptx[pdf]
1
Introduction: Where do NLP and DM Meet?
1
7/26/2015
Slightly Different Research Philosophies NLP: Deep understanding of individual words, phrases and sentences (“micro‐level”); focus on unstructured text data
Data Mining (DM): High‐level (statistical) understanding, discovery and synthesis of the most salient information (“Macro‐ level”); historically more on structured and semi‐structured data Related to “Health Care Bill”
NewsNet (Tao et al., 2014)
3
DM Solution: Data to Networks to Knowledge (D2N2K) Advantages of NLP Construct graphs/networks with fine‐grained semantics from unstructured texts Use large‐scale annotations for real‐world data Advantages of DM: Deep understanding through structured/correlation inference Using a structured representation (e.g., graph, network) as a bridge to capture interactions between NLP and DM Example: Heterogeneous Information Networks [Han et al., 2010; Sun et al., 2012]
4
Data Networks
Knowledge
2
7/26/2015
A Promising Direction: Integrating DM and NLP Major theme of this tutorial Applying novel DM methods to solve traditional NLP problems Integrating DM and NLP, transforming Data to Networks to Knowledge Road Map of this tutorial Effective Network Construction by Leveraging Information Redundancy Theme I: Phrase Mining and Topic Modeling from Large Corpora Theme II: Entity Extraction and Linking by Relational Graph Construction Mining Knowledge from Structured Networks Theme III: Search and Mining Structured Graphs and Heterogeneous Networks Looking forward to the Future
5
Theme I: Phrase Mining and Topic Modeling from Large Corpora
6
3
7/26/2015
Why Phrase Mining?
Phrase: Minimal, unambiguous semantic unit; basic building block for information network and knowledge base
Unigrams vs. phrases Unigrams (single words) are ambiguous Example: “United”: United States? United Airline? United Parcel Service? Phrase: A natural, meaningful, unambiguous semantic unit Example: “United States” vs. “United Airline”
Mining semantically meaningful phrases Transform text data from word granularity to phrase granularity Enhance the power and efficiency at manipulating unstructured data using
database technology 7
Mining Phrases: Why Not Just Use NLP Methods?
Phrase mining: Originated from the NLP community—“Chunking” Model it as a sequence labeling problem (B‐NP, I‐NP, O, …)
Need annotation and training Annotate hundreds of POS tagged documents as training data Train a supervised model based on part‐of‐speech features
Recent trend: Use distributional features based on web n‐grams (Bergsma et al., 2010) State‐of‐the‐art Performance: ~95% accuracy, ~88% phrase‐level F‐score
Limitations High annotation cost, not scalable to a new language, domain or genre May not fit domain‐specific, dynamic, emerging applications
Scientific domains, query logs, or social media, e.g., Yelp, Twitter Use only local features, no ranking, no links to topics
8
4
7/26/2015
Data Mining Approaches for Phrase Mining General principle: Corpus‐based; fully exploit information redundancy and data‐ driven criteria to determine phrase boundaries and salience; using local evidence to adjust corpus‐level data statistics Phrase Mining and Topic Modeling from Large Corpora Strategy 1: Simultaneously Inferring Phrases and Topics
Strategy 2: Post Topic Modeling Phrase Construction
Label topic [Mei et al.’07], TurboTopic [Blei & Lafferty’09], KERT [Danilevsky, et al.’14]
Strategy 3: First Phrase Mining then Topic Modeling:
Bigram topical model [Wallach’06], topical n‐gram model [Wang, et al.’07], phrase discovering topic model [Lindsey, et al.’12]
ToPMine [El‐kishky, et al., VLDB’15]
Integration of Phrase Mining with Document Segmentation
SegPhrase [Liu, et al., SIGMOD’15]
9
Strategy 1: Simultaneously Inferring Phrases and Topics Bigram Topic Model [Wallach’06] Probabilistic generative model that conditions on previous word and topic when drawing next word Topical N‐Grams (TNG) [Wang, et al.’07] Probabilistic model that generates words in textual order Create n‐grams by concatenating successive bigrams (a generalization of Bigram Topic Model) Phrase‐Discovering LDA (PDLDA) [Lindsey, et al.’12] Viewing each sentence as a time‐series of words, PDLDA posits that the generative parameter (topic) changes periodically Each word is drawn based on previous m words (context) and current phrase topic High model complexity: Tends to overfitting; High inference cost: Slow
10
5
7/26/2015
Strategy 2: Post Topic Modeling Phrase Construction
TurboTopics [Blei & Lafferty’09] – Phrase construction as a post‐processing step to Latent Dirichlet Allocation
Perform Latent Dirichlet Allocation on corpus to assign each token a topic label
Merge adjacent unigrams with the same topic label by a distribution‐free permutation test on arbitrary‐length back‐off model
End recursive merging when all significant adjacent unigrams have been merged
KERT [Danilevsky et al.’14] – Phrase construction as a post‐processing step to Latent Dirichlet Allocation
Perform frequent pattern mining on each topic
Perform phrase ranking based on four different criteria
11
Example of TurboTopics
12
Perform LDA on corpus to assign each token a topic label E.g., … phase11 transition11 …. game153 theory127 … Then merge adjacent unigrams with same topic label
6
7/26/2015
Framework of KERT 1. Run bag‐of‐words model inference and assign topic label to each token 2. Extract candidate keyphrases within each topic
Frequent pattern mining
3. Rank the keyphrases in each topic
Comparability property: directly compare phrases of mixed lengths kpRel [Zhao et al. 11]
learning
KERT (‐popularity)
KERT (‐discriminativeness)
KERT (‐concordance)
KERT [Danilevsky et al. 14]
effective
support vector machines
learning
learning
classification
text
feature selection
classification
support vector machines
selection
probabilistic
reinforcement learning
selection
reinforcement learning
13
models
identification
conditional random fields
feature
feature selection
algorithm
mapping
constraint satisfaction
decision
conditional random fields
features
task
decision trees
bayesian
classification
decision
planning
dimensionality reduction
trees
decision trees
:
:
:
:
:
Strategy 3: First Phrase Mining then Topic Modeling
ToPMine [El‐Kishky et al. VLDB’15]
First phrase construction, then topic mining
Contrast with KERT: topic modeling, then phrase mining
The ToPMine Framework:
Perform frequent contiguous pattern mining to extract candidate phrases and their counts
Perform agglomerative merging of adjacent unigrams as guided by a significance score—This segments each document into a “bag‐of‐phrases”
The newly formed bag‐of‐phrases are passed as input to PhraseLDA, an extension of LDA, that constrains all words in a phrase to each sharing the same latent topic
14
7
7/26/2015
Why First Phrase Mining then Topic Modeling ? With Strategy 2, tokens in the same phrase may be assigned to different topics Ex. knowledge discovery using least squares support vector machine classifiers… Knowledge discovery and support vector machine should have coherent topic labels Solution: switch the order of phrase mining and topic model inference
[knowledge discovery] using [least squares] [support vector machine] [classifiers] …
[knowledge discovery] using [least squares] [support vector machine] [classifiers] …
Techniques Phrase mining and document segmentation Topic model inference with phrase constraint
15
Phrase Mining: Frequent Pattern Mining + Statistical Analysis Quality phrases
Based on significance score [Church et al.’91]:
α(P1, P2) ≈ (f(P1●P2) ̶ µ0(P1,P2))/√ f(P1●P2) [Markov blanket] [feature selection] for [support vector machines] [knowledge discovery] using [least squares] [support vector machine] [classifiers] …[support vector] for [machine learning]…
Raw freq.
True freq.
[support vector machine]
90
80
[vector machine]
95
0
[support vector]
100
20
Phrase
16
8
7/26/2015
Collocation Mining
Collocation: A sequence of words that occur more frequently than expected Often “interesting” and due to their non‐compositionality, often relay information not portrayed by their constituent terms (e.g., “made an exception”, “strong tea”) Many different measures used to extract collocations from a corpus [Dunning 93, Pederson 96] E.g., mutual information, t‐test, z‐test, chi‐squared test, likelihood ratio
Many of these measures can be used to guide the agglomerative phrase‐ segmentation algorithm
17
ToPMine: Phrase LDA (Constrained Topic Modeling) The generative model for PhraseLDA is the same as LDA Difference: the model incorporates constraints obtained from the “bag‐of‐phrases” input Chain‐graph shows that all words in a phrase are constrained to take on the same topic values
[knowledge discovery] using [least squares] [support vector machine] [classifiers] …
Topic model inference with phrase constraints 18
9
7/26/2015
Example Topical Phrases: A Comparison social networks information retrieval web search text classification time series machine learning search engine support vector machines management system information extraction real time neural networks decision trees text categorization : : Topic 1
Topic 2
PDLDA [Lindsey et al. 12] Strategy 1 (3.72 hours)
information retrieval
feature selection
social networks web search search engine information extraction
machine learning semi supervised large scale support vector machines
question answering
active learning
web pages
face recognition
:
:
Topic 1
Topic 2
ToPMine [El‐kishky et al. 14] Strategy 3 (67 seconds)
19
ToPMine: Experiments on DBLP Abstracts
20
10
7/26/2015
ToPMine: Topics on Associate Press News (1989)
21
ToPMine: Experiments on Yelp Reviews
22
11
7/26/2015
Efficiency: Running Time of Different Strategies
Running time: strategy 3 > strategy 2 > strategy 1 (“>” means outperforms)
Strategy 1: Generate bag‐of‐words → generate sequence of tokens Strategy 2: Post bag‐of‐words model inference, visualize topics with n‐grams Strategy 3: Prior bag‐of‐words model inference, mine phrases and impose to the bag‐of‐words model
23
Coherence of Topics: Comparison of Strategies
Coherence measured by z‐score: strategy 3 > strategy 2 > strategy 1
Strategy 1: Generate bag‐of‐words → generate sequence of tokens Strategy 2: Post bag‐of‐words model inference, visualize topics with n‐grams Strategy 3: Prior bag‐of‐words model inference, mine phrases and impose to the bag‐of‐words model
24
12
7/26/2015
Phrase Intrusion: Comparison of Strategies
Phrase intrusion measured by average number of correct answers: strategy 3 > strategy 2 > strategy 1 25
Phrase Quality: Comparison of Strategies
Phrase quality measured by z‐score: strategy 3 > strategy 2 > strategy 1 26
13
7/26/2015
Mining Phrases: Why Not Use Raw Frequency Based Methods? Traditional data‐driven approaches
Frequent pattern mining
If AB is frequent, likely AB could be a phrase
Raw frequency could NOT reflect the quality of phrases
E.g., freq(vector machine) ≥ freq(support vector machine)
Need to rectify the frequency based on segmentation results Phrasal segmentation will tell
Some words should be treated as a whole phrase whereas others are still unigrams
27
SegPhrase: From Raw Corpus to Quality Phrases and Segmented Corpus Raw Corpus
Quality Phrases
Segmented Corpus
Document 1 Citation recommendation is an interesting but challenging research problem in data mining area. Document 2 In this study, we investigate the problem in the context of heterogeneous information networks using data mining technique. Document 3 Principal Component Analysis is a linear dimensionality reduction technique commonly used in machine learning applications.
Phrase Mining Input Raw Corpus
E.g., freq(vector machine) ≥ freq(support vector machine)
Need to rectify the frequency based on segmentation results
Segmented Corpus
Raw frequency could NOT reflect the quality of phrases
28
Phrasal Segmentation
Quality Phrases
Build a candidate phrase set by frequent pattern mining Mining frequent k‐grams; k is typically small, e.g. 6 in our experiments
Popularity measured by raw frequent words and phrases mined from the corpus
14
7/26/2015
SegPhrase: The Overall Framework
ClassPhrase: Frequent pattern mining, feature extraction, classification SegPhrase: Phrasal segmentation and phrase quality estimation SegPhrase+: One more round to enhance mined phrase quality
ClassPhrase
SegPhrase(+)
29
What Kind of Phrases Are of “High Quality”? Judging the quality of phrases Popularity “information retrieval” vs. “cross‐language information retrieval” Concordance “powerful tea” vs. “strong tea” “active learning” vs. “learning classification” Informativeness “this paper” (frequent but not discriminative, not informative) Completeness “vector machine” vs. “support vector machine”
30
15
7/26/2015
Feature Extraction: Concordance Partition a phrase into two parts to check whether the co‐occurrence is significantly higher than pure random
support vector machine this paper demonstrates
Pointwise mutual information:
Pointwise KL divergence: The additional p(v) is multiplied with pointwise mutual information, leading to less bias towards rare‐occurred phrases
31
Feature Extraction: Informativeness Deriving Informativeness
Quality phrases typically start and end with a non‐stopword
“machine learning is” vs. “machine learning”
Use average IDF over words in the phrase to measure the semantics
Usually, the probabilities of a quality phrase in quotes, brackets, or connected by dash should be higher (punctuation information)
“state‐of‐the‐art”
We can also incorporate features using some NLP techniques, such as POS tagging, chunking, and semantic parsing
32
16
7/26/2015
Classifier Limited Training
Labels: Whether a phrase is a quality one or not
“support vector machine”: 1
“the experiment shows”: 0
For ~1GB corpus, only 300 labels Random Forest as our classifier
Predicted phrase quality scores lie in [0, 1]
Bootstrap many different datasets from limited labels
33
SegPhrase: Why Do We Need Phrasal Segmentation in Corpus? Phrasal segmentation can tell which phrase is more appropriate Ex: A standard [feature vector] [machine learning] setup is used to describe...
Not counted towards the rectified frequency
Rectified phrase frequency (expected influence) Example:
34
17
7/26/2015
SegPhrase: Segmentation of Phrases Partition a sequence of words by maximizing the likelihood Considering
Phrase quality score
ClassPhrase assigns a quality score for each phrase
Probability in corpus
Length penalty
length penalty : when
1, it favors shorter phrases
Filter out phrases with low rectified frequency Bad phrases are expected to rarely occur in the segmentation results
35
SegPhrase+: Enhancing Phrasal Segmentation SegPhrase+: One more round for enhanced phrasal segmentation Feedback Using rectified frequency, re‐compute those features previously computed based on raw frequency Process Classification Phrasal segmentation // SegPhrase Classification Phrasal segmentation // SegPhrase+ Effects on computing quality scores np hard in the strong sense np hard in the strong data base management system
36
18
7/26/2015
Performance Study: Methods to Be Compared
Other phase mining methods: Methods to be compared NLP chunking based methods
Chunks as candidates
Sorted by TF‐IDF and C‐value (K. Frantzi et al., 2000) Unsupervised raw frequency based methods
ConExtr (A. Parameswaran et al., VLDB 2010)
ToPMine (A. El‐Kishky et al., VLDB 2015) Supervised method
KEA, designed for single document keyphrases (O. Medelyan & I. H. Witten, 2006)
37
Performance Study: Experimental Setting
Datasets Dataset
#docs
#words
#labels
DBLP
2.77M
91.6M
300
Yelp
4.75M
145.1M
300
Popular Wiki Phrases Based on internal links ~7K high quality phrases Pooling Sampled 500 * 7 Wiki‐uncovered phrases Evaluated by 3 reviewers independently
38
19
7/26/2015
Performance: Precision Recall Curves on DBLP Compare with other baselines TF‐IDF C‐Value ConExtr KEA ToPMine SegPhrase+
Compare with our 3 variations TF‐IDF ClassPhrase SegPhrase SegPhrase+
39 39
Performance Study: Processing Efficiency
SegPhrase+ is linear to the size of corpus!
40
20
7/26/2015
Experimental Results: Interesting Phrases Generated (From the Titles and Abstracts of SIGMOD) Query
SIGMOD
Method
SegPhrase+
Chunking (TF‐IDF & C‐Value)
1
data base
data base
2
database system
database system
3
relational database
query processing
4
query optimization
query optimization
5
query processing
relational database
…
…
…
51
sql server
database technology
52
relational data
database server
53
data structure
large volume
54
join query
performance study
55
web service
…
…
Only in SegPhrase+
web service
Only in Chunking
…
201
high dimensional data
202
location based service
efficient implementation sensor network
203
xml schema
large collection
204
two phase locking
important issue
205
deep web
frequent itemset
…
…
…
41
Experimental Results: Interesting Phrases Generated (From the Titles and Abstracts of SIGKDD) Query
SIGKDD
Method
SegPhrase+
1
data mining
Chunking (TF‐IDF & C‐Value) data mining
2
data set
association rule
3
association rule
knowledge discovery
4
knowledge discovery
frequent itemset
5
time series
decision tree
…
…
…
51
association rule mining
search space
52
rule set
domain knowledge
53
concept drift
importnant problem concurrency control
54
knowledge acquisition
55
gene expression data
conceptual graph
…
…
…
201
web content
Only in SegPhrase+
optimal solution
202
frequent subgraph
semantic relationship
203
intrusion detection
effective way
204
categorical attribute
space complexity
205
user preference
small set
…
…
…
Only in Chunking
42
21
7/26/2015
Experimental Results: Similarity Search Find high‐quality similar phrases based on user’s phrase query In response to a user’s phrase query, SegPhrase+ generates high quality, semantically similar phrases In DBLP, query on “data mining” and “OLAP” In Yelp, query on “blu‐ray”, “noodle”, and “valet parking”
43
Recent Progress Distant Training: No need of human labeling
Training using general knowledge bases
E.g., Freebase, Wikipedia
Quality Estimation for Unigrams Integration of phrases and unigrams in one uniform framework
Demo based on DBLP abstract
Multi‐languages: Beyond English corpus
Extensible to mining quality phrases in multiple languages
Recent progress: SegPhrase+ works on Chinese, Arabic and Spanish
44
22
7/26/2015
High Quality Phrases Generated from Chinese Wikipedia Rank
Phrase
In English
…
…
…
62
首席_执行官
CEO
63
中间_偏右
Middle‐right
…
…
…
84
百度_百科
Baidu Pedia
85
热带_气旋
Tropical cyclone
86
中国科学院_院士
Fellow of Chinese Academy of Sciences
…
…
…
1001
十大_中文_金曲
Top‐10 Chinese Songs
1002
全球_资讯网
Global News Website
1003
天一阁_藏_明代_科举_录_选刊
A Chinese book name
…
…
…
9934
国家_戏剧_院
National Theater
9935
谢谢_你
Thank you
…
…
…
45
Top-Ranked Phrases Generated from English Gigaword
Northrop Grumman, Ashfaq Kayani, Sania Mirza, Pius Xii, Shakhtar Donetsk, Kyaw Zaw Lwin
Ratko Mladic, Abdolmalek Rigi, Rubin Kazan, Rajon Rondo, Rubel Hossain, bluefin tuna
Psv Eindhoven, Nicklas Bendtner, Ryo Ishikawa, Desmond Tutu, Landon Donovan, Jannie du Plessis
Zinedine Zidane, Uttar Pradesh, Thor Hushovd, Andhra Pradesh, Jafar_Panahi, Marouane Chamakh
Rahm Emanuel, Yakubu Aiyegbeni, Salva Kiir, Abdelhamid Abou Zeid, Blaise Compaore, Rickie Fowler
Andry Rajoelina, Merck Kgaa, Js Kabylie, Arjun Atwal, Andal Ampatuan Jnr, Reggio Calabria, Ghani Baradar
Mahela Jayawardene, Jemaah Islamiyah, quantitative easing, Nodar Kumaritashvili, Alviro Petersen
Rumiana Jeleva, Helio Castroneves, Koumei Oda, Porfirio Lobo, Anastasia Pavlyuchenkova
Thaksin Shinawatra, Evgeni_Malkin, Salvatore Sirigu, Edoardo Molinari, Yoshito Sengoku
Otago Highlanders, Umar Akmal, Shuaibu Amodu, Nadia Petrova, Jerzy Buzek, Leonid Kuchma,
Alona Bondarenko, Chosun Ilbo, Kei Nishikori, Nobunari Oda, Kumbh Mela, Santo_Domingo
Nicolae Ceausescu, Yoann Gourcuff, Petr Cech, Mirlande Manigat, Sulieman Benn, Sekouba Konate
46
23
7/26/2015
Summary and Future Work Strategy 1: Generate bag‐of‐words → generate sequence of tokens Integrated complex model; phrase quality and topic inference rely on each other Slow and overfitting Strategy 2: Post bag‐of‐words model inference, visualize topics with n‐grams Phrase quality relies on topic labels for unigrams Can be fast; generally high‐quality topics and phrases Strategy 3: Prior bag‐of‐words model inference, mine phrases and impose to the bag‐of‐words model Topic inference relies on correct segmentation of documents, but not sensitive Can be fast; generally high‐quality topics and phrases
47
Summary and Future Work (Cont’d)
SegPhrase+: A new phrase mining framework
Integrating phrase mining with phrasal segmentation
Requires only limited training or distant training
Generates high‐quality phrases, close to human judgement
Linearly scalable on time and space
Looking forward: High‐quality, scalable phrase mining
Facilitate entity recognition and typing in large corpora (See the next part of this tutorial)
Combine with linguistic‐rich patterns
Transform massive unstructured data into semi‐structured knowledge networks
48
24
7/26/2015
References for Theme 1: Phrase Mining for Concept Extraction
D. M. Blei and J. D. Lafferty. Visualizing Topics with Multi‐Word Expressions, arXiv:0907.1013, 2009
K. Church, W. Gale, P. Hanks, D. Hindle. Using Statistics in Lexical Analysis. In U. Zernik (ed.), Lexical Acquisition: Exploiting On‐Line Resources to Build a Lexicon. Lawrence Erlbaum, 1991
M. Danilevsky, C. Wang, N. Desai, X. Ren, J. Guo, J. Han. Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents“, SDM’14
A. El‐Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable Topical Phrase Mining from Text Corpora. VLDB’15
K. Frantzi, S. Ananiadou, and H. Mima, Automatic Recognition of Multi‐Word Terms: the c‐value/nc‐value Method. Int. Journal on Digital Libraries, 3(2), 2000
R. V. Lindsey, W. P. Headden, III, M. J. Stipicevic. A phrase‐discovering topic model using hierarchical pitman‐yor processes, EMNLP‐CoNLL’12.
J. Liu, J. Shang, C. Wang, X. Ren, J. Han, Mining Quality Phrases from Massive Text Corpora. SIGMOD’15
O. Medelyan and I. H. Witten, Thesaurus Based Automatic Keyphrase Indexing. IJCDL’06
Q. Mei, X. Shen, C. Zhai. Automatic Labeling of Multinomial Topic Models, KDD’07
A. Parameswaran, H. Garcia‐Molina, and A. Rajaraman. Towards the Web of Concepts: Extracting Concepts from Large Datasets. VLDB’10
X. Wang, A. McCallum, X. Wei. Topical n‐grams: Phrase and topic discovery, with an application to information retrieval, ICDM’07
49
Theme II: Entity Extraction and Linking by Relational Graph Construction and Propagation
50
25
7/26/2015
Session 1. Task and Traditional NLP Approach
Entity Extraction and Linking Identifying token span as entity mentions in documents and labeling their types Target Types FOOD LOCATION JOB_TITLE EVENT ORGANIZATION
…
The best BBQ I’ve tasted in Phoenix! I had the pulled pork sandwich with coleslaw and baked beans for lunch. ... The owner is very nice. …
The best BBQ:Food I’ve tasted in Phoenix:LOC ! I had the [pulled pork sandwich]:Food with coleslaw:Food and [baked beans]:Food for lunch. … The owner:JOB_TITLE is very nice. …
Text with typed entities
Plain text FOOD
LOCATION
EVENT
Enabling structured analysis of unstructured text corpus 52
26
7/26/2015
Traditional NLP Approach: Data Annotation Extracting and linking entities can be used in a variety of ways: serve as primitives for information extraction and knowledge base population assist question answering,… Traditional named entity recognition systems are designed for major types (e.g., PER, LOC, ORG) and general domains (e.g., news) Require additional steps to adapt to new domains/types Expensive human labor on annotation 500 documents for entity extraction; 20,000 queries for entity linking Unsatisfying agreement due to various granularity levels and scopes of types Entities obtained by entity linking techniques have limited coverage and freshness >50% unlinkable entity mentions in Web corpus [Lin et al., EMNLP’12] >90% in our experiment corpora: tweets, Yelp reviews, …
53
Traditional NLP Approach: Feature Engineering Typical Entity Extraction Features (Li et al., 2012) N‐gram: Unigram, bigram and trigram token sequences in the context window Part‐of‐Speech: POS tags of the context words Gazetteers: person names, organizations, countries and cities, titles, idioms, etc. Word clusters: word clusters / embeddings Case and Shape: Capitalization and morphology analysis based features Chunking: NP and VP Chunking tags Global feature: Sentence level and document level structure/position features
54
27
7/26/2015
Traditional NLP Approach: Feature Engineering
Typical Entity Linking Features (Ji et al., 2011) Mention/Concept Attribute Name
Document surface
Entity Context Profiling Concept Topic KB Link Mining Popularity
55
Description
Spelling match
Exact string match, acronym match, alias match, string matching…
KB link mining Name Gazetteer Lexical
Name pairs mined from KB text redirect and disambiguation pages Organization and geo‐political entity abbreviation gazetteers
Position Genre Local Context Type Relation/Event Coreference
Web Frequency
Words in KB facts, KB text, mention name, mention text. Tf.idf of words and ngrams Mention name appears early in KB text Genre of the mention text (newswire, blog, …) Lexical and part‐of‐speech tags of context words Mention concept type, subtype Concepts co‐occurred, attributes/relations/events with mention Co‐reference links between the source document and the KB text Slot fills of the mention, concept attributes stored in KB infobox Ontology extracted from KB text Topics (identity and lexical similarity) for the mention text and KB text Attributes extracted from hyperlink graphs of the KB text Top KB text ranked by search engine and its length Frequency in KB texts
Session 2. ClusType: Entity Recognition and Typing by Relation Phrase-Based Clustering
28
7/26/2015
Entity Recognition and Typing: A Data Mining Solution
Acquire labels for a small amount of instances
Construct a relational graph to connect labeled instances and unlabeled instances
Construct edges based on coarse‐grained data‐driven statistics instead of fine‐ grained linguistic similarity
Mention correlation
Text co‐occurrence
Semantic relatedness based on knowledge graph embeddings
Social networks Label propagation across the graph
57
Case Study 1: Entity Extraction
Goal: recognizing entity mentions of target types with minimal/no human supervision and with no requirement that entities can be found in a KB.
Two kinds of efforts towards this goal:
Weak supervision: relies on manually selected seed entities in applying pattern‐based bootstrapping methods or label propagation methods to identify more entities Both assume seeds are unambiguous and sufficiently frequent requires careful seed selection by human Distant supervision: leverages entity information in KBs to reduce human supervision (cont.)
58
29
7/26/2015
Typical Workflow of Distant Supervision
Detect entity mentions from text Map candidate mentions to KB entities of target types Use confidently mapped {mention, type} to infer types of remaining candidate mentions
59
Problem Definition Distantly‐supervised entity recognition in a domain‐specific corpus Given: a corpus D a knowledge base (e.g., Freebase) a set of target types (T) from a KB Detect candidate entity mentions from corpus D Categorize each candidate mention by target types or Not‐Of‐Interest (NOI), with distant supervision
60
30
7/26/2015
Challenge I: Domain Restriction Most existing work assume entity mentions are already extracted by existing entity detection tools, e.g., noun phrase chunkers Usually trained on general‐domain corpora like news articles (clean, grammatical) Make use of various linguistics features (e.g., semantic parsing structures) Do not work well on specific, dynamic or emerging domains (e.g., tweets, Yelp reviews) E.g., “in‐and‐out” from Yelp review may not be properly detected
61
Challenge II: Name Ambiguity
Multiple entities may share the same surface name
While Griffin is not the part of Washington’s plan on Sunday’s game, …
Sport team
…has concern that Kabul is an ally of Washington.
U.S. government
He has office in Washington, Boston and San Francisco
U.S. capital city
Previous methods simply output a single type/type distribution for each surface name, instead of an exact type for each entity mention
While Griffin is not the part of Washington’s plan on Sunday’s game, … … news from Washington indicates that the congress is going to… It is one of the best state parks in Washington.
Washington
State
or
Washington Sport Govern State team ‐ment
…
62
31
7/26/2015
Challenge III: Context Sparsity
A variety of contextual clues are leveraged to find sources of shared semantics across different entities Keywords, Wiki concepts, linguistic patterns, textual relations, …
There are often many ways to describe even the same relation between two entities ID
Sentence
Freq
1
The magnitude 9.0 quake caused widespread devastation in [Kesennuma city]
12
2
… tsunami that ravaged [northeastern Japan] last Friday
31
3
The resulting tsunami devastate [Japan]’s northeast
244
Previous methods have difficulties in handling entity mention with sparse (infrequent) context
63
Our Solution Domain‐agnostic phrase mining algorithm: Extracts candidate entity mentions with minimal linguistic assumption address domain restriction E.g., part‐of‐speech (POS) tagging “Miami Heat”) Incorporate user interests Non‐linkable entity mention recognization and clustering
175 Error Distribution
175
Transformation-Based Graph Querying Transformation‐based graph querying produces many results How to suggest the “best” results to users? Different transformations, should be weighted differently
Ex.: Given a single node query, “Geoffrey Hinton” Nodes with “G. Hinton” (Abbreviation transformation) shall be ranked higher than Nodes with “Hinton” (Last token transformation) How to determine them? Weights shall be learned Evaluate a Candidate Match: Ranking Function Features FV (v, ( v )) i f i ( v, (v )) i Node matching feature: FE ( e, ( e)) j g j ( e, ( e)) Edge matching feature: j Matching Score: P ( (Q ) | Q ) exp( FV ( v, ( v )) FE ( e, ( e))) 176
vVQ
eEQ
88
7/26/2015
Potential Challenges in Applying Graph Indices + Search to NLP The graphs automatically constructed from natural language texts might:
Include many more diverse types and weights
Include more ambiguity
Knowledge sparsity; hard to generalize unique nodes/substructures
Include more noise and errors
177
Mining Frequent (Sub)Graph Patterns
Data mining research has developed many scalable graph pattern mining methods
Given a labeled graph dataset D = {G1, G2, …, Gn), the supporting graph set of a subgraph g is Dg = {Gi | g Gi, Gi D}.
support(g) = |Dg|/ |D| A (sub)graph g is frequent if support(g) ≥ min_sup
OH
Ex.: Chemical structures
N
S
N
O
178
Alternative: Mining frequent subgraph patterns from a single large graph or network Documents can be viewed as graph structures as well
Graph Dataset
O
O
HO
O
(A)
N
O
(B)
min_sup = 2
O
N
(C) Frequent Graph Patterns O
(1) support = 67%
(2) N
O
N N
89
7/26/2015
Efficient Graph Pattern Mining Methods
Effective Graph pattern mining methods have been developed for different scenarios Mining graph “transaction” datasets
Ex. gSpan (Yan et al, ICDM’02), CloseGraph (Yan et al., KDD’03), … Mining frequent large subgraph structures in single massive network
Ex. Spider‐Mine (Zhu, et al., VLDB’11)
Graph pattern mining forms building blocks for graph classification, clustering, compression, comparison, and correlation analysis
Graph indexing and graph similarity search gIndex (SIGMOD’05): Graph indexing by graph pattern mining
Help precise as well as similarity‐based graph query search and answering
179
Mining Collaboration Patterns in DBLP Networks Data description: 600 top confs, 9 major CS areas, 15071 authors in DB/DM Author labeled by # of papers published in DB/DM Prolific (P): >=50, Senior (S): 20~49, Junior (J): 10~19, Beginner(B): 5~9
Patterns found
Instance in data
Instance in data
Patterns found Instance in data Instance in data Patterns found 180
90
7/26/2015
Applications of Graph Pattern Mining Bioinformatics
Gene networks, protein interactions, metabolic pathways
Chem‐informatics: Mining chemical compound structures
Social networks, web communities, tweets, …
Cell phone networks, computer networks, …
Web graphs, XML structures, semantic Web, information networks
Software engineering: Program execution flow analysis
Building blocks for graph classification, clustering, compression, comparison, and correlation analysis
Graph indexing and graph similarity search
181
Logistic Regression-Based Prediction Model Training and test pair: =
A‐P‐A‐P‐A
A‐P‐V‐P‐A
A‐P‐T‐P‐A
A‐P‐>P‐A
A‐P‐A
4
5
100
3
Yes = 1
0
1
20
2
No = 0
Logistic Regression Model Model the probability for each relationship as
is the coefficients for each feature (including a constant 1) MLE estimation Maximize the likelihood of observing all the relationships in the training data
182
91
7/26/2015
When Will It Happen?
From “whether” to “when” “Whether”: Will Jim rent the movie “Avatar” in Netflix? Output: P(X=1)=?
“When”: When will Jim rent the movie “Avatar”?
Output: A distribution of time!
What is the probability Jim will rent “Avatar” within 2 months? By when Jim will rent “Avatar” with 90% probability? : 0.9 What is the expected time it will take for Jim to rent “Avatar”?
2
May provide useful information to supply chain management
183
The Relationship Building Time Prediction Model Solution Directly model relationship building time: P(Y=t) Geometric distribution, Exponential distribution, Weibull distribution Use generalized linear model Deal with censoring (relationship builds beyond the observed time T: Right interval)
Censoring
Generalized Linear Model under Weibull Distribution Assumption
Training Framework 184
92
View more...
Comments