slides in pdf

January 11, 2018 | Author: Anonymous | Category: N/A

Share Embed

Report this link

Short Description

Download slides in pdf...

Description

7/26/2015

Successful Data Mining Methods for NLP Jiawei Han (UIUC), Heng Ji (RPI), Yizhou Sun (NEU) http://hanj.cs.illinois.edu/slides/dmnlp15.pptx[pdf] http://nlp.cs.rpi.edu/paper/dmnlp15.pptx[pdf]

1

Introduction: Where do NLP and DM Meet?

1

7/26/2015

Slightly Different Research Philosophies  NLP:  Deep understanding of individual words, phrases and sentences (“micro‐level”); focus on unstructured text data



Data Mining (DM): High‐level (statistical) understanding, discovery and synthesis of the most salient information (“Macro‐ level”); historically more on structured and semi‐structured data Related to “Health Care Bill”



NewsNet (Tao et al., 2014)

3

DM Solution: Data to Networks to Knowledge (D2N2K) Advantages of NLP  Construct graphs/networks with fine‐grained semantics from unstructured texts  Use large‐scale annotations for real‐world data  Advantages of DM: Deep understanding through structured/correlation inference  Using a structured representation (e.g., graph, network) as a bridge to capture interactions between NLP and DM  Example: Heterogeneous Information Networks [Han et al., 2010; Sun et al., 2012] 

4

Data                                                                    Networks

Knowledge

2

7/26/2015

A Promising Direction: Integrating DM and NLP Major theme of this tutorial  Applying novel DM methods to solve traditional NLP problems  Integrating DM and NLP, transforming Data to Networks to Knowledge  Road Map of this tutorial  Effective Network Construction by Leveraging Information Redundancy  Theme I: Phrase Mining and Topic Modeling from Large Corpora  Theme II: Entity Extraction and Linking by Relational Graph Construction  Mining Knowledge from Structured Networks  Theme III: Search and Mining Structured Graphs and Heterogeneous Networks  Looking forward to the Future 

5

Theme I: Phrase Mining and Topic Modeling from Large Corpora

6

3

7/26/2015

Why Phrase Mining? 

Phrase: Minimal, unambiguous semantic unit; basic building block for information network and knowledge base



Unigrams vs. phrases  Unigrams (single words) are ambiguous  Example: “United”: United States? United Airline? United Parcel Service?  Phrase:  A natural, meaningful, unambiguous semantic unit  Example:  “United States” vs. “United Airline”



Mining semantically meaningful phrases  Transform text data from word granularity to phrase granularity  Enhance the power and efficiency at manipulating unstructured data using

database technology 7

Mining Phrases: Why Not Just Use NLP Methods? 

Phrase mining:  Originated from the NLP community—“Chunking” Model it as a sequence labeling problem (B‐NP, I‐NP, O, …)





Need annotation and training Annotate hundreds of POS tagged documents as training data Train a supervised model based on part‐of‐speech features

 



Recent trend: Use distributional features based on web n‐grams (Bergsma et al., 2010) State‐of‐the‐art Performance: ~95% accuracy, ~88% phrase‐level F‐score

 



Limitations High annotation cost, not scalable to a new language, domain or genre May not fit domain‐specific, dynamic, emerging applications

   

Scientific domains, query logs, or social media, e.g., Yelp, Twitter Use only local features, no ranking, no links to topics

8

4

7/26/2015

Data Mining Approaches for Phrase Mining General principle: Corpus‐based; fully exploit information redundancy and data‐ driven criteria to determine phrase boundaries and salience; using local evidence to adjust corpus‐level data statistics  Phrase Mining and Topic Modeling from Large Corpora  Strategy 1: Simultaneously Inferring Phrases and Topics 



Strategy 2: Post Topic Modeling Phrase Construction

 

Label topic [Mei et al.’07], TurboTopic [Blei & Lafferty’09], KERT [Danilevsky, et al.’14]

Strategy 3: First Phrase Mining then Topic Modeling:

 



Bigram topical model [Wallach’06], topical n‐gram model [Wang, et al.’07], phrase discovering topic model [Lindsey, et al.’12]

ToPMine [El‐kishky, et al., VLDB’15]

Integration of Phrase Mining with Document Segmentation 

SegPhrase [Liu, et al., SIGMOD’15]

9

Strategy 1: Simultaneously Inferring Phrases and Topics Bigram Topic Model [Wallach’06]  Probabilistic generative model that conditions on previous word and topic when drawing next word  Topical N‐Grams (TNG) [Wang, et al.’07]  Probabilistic model that generates words in textual order  Create n‐grams by concatenating successive bigrams (a generalization of Bigram Topic Model)  Phrase‐Discovering LDA (PDLDA) [Lindsey, et al.’12]  Viewing each sentence as a time‐series of words, PDLDA posits that the generative parameter (topic) changes periodically  Each word is drawn based on previous m words (context) and current phrase topic  High model complexity: Tends to overfitting; High inference cost: Slow 

10

5

7/26/2015

Strategy 2: Post Topic Modeling Phrase Construction 

TurboTopics [Blei & Lafferty’09] – Phrase construction as a post‐processing step to Latent Dirichlet Allocation



Perform Latent Dirichlet Allocation on corpus to assign each token a topic label



Merge adjacent unigrams with the same topic label by a distribution‐free permutation test on arbitrary‐length back‐off model



End recursive merging when all significant adjacent unigrams have been merged

KERT [Danilevsky et al.’14] – Phrase construction as a post‐processing step to Latent Dirichlet Allocation

 

Perform frequent pattern mining on each topic



Perform phrase ranking based on four different criteria

11

Example of TurboTopics

  12



Perform LDA on corpus to assign each token a topic label E.g., … phase11 transition11 …. game153 theory127 … Then merge adjacent unigrams with same topic label

6

7/26/2015

Framework of KERT 1. Run bag‐of‐words model inference and assign topic label to each token 2. Extract candidate keyphrases within each topic

Frequent pattern mining

3. Rank the keyphrases in each topic

Comparability property:  directly compare phrases of mixed lengths kpRel [Zhao et al. 11]

learning

KERT (‐popularity)

KERT (‐discriminativeness)

KERT (‐concordance)

KERT [Danilevsky et al. 14]

effective

support vector machines

learning

learning

classification

text

feature selection

classification

support vector machines

selection

probabilistic

reinforcement learning

selection

reinforcement learning

13

models

identification

conditional random fields

feature

feature selection

algorithm

mapping

constraint satisfaction

decision

conditional random fields

features

task

decision trees

bayesian

classification

decision

planning

dimensionality reduction

trees

decision trees

:

:

:

:

:

Strategy 3: First Phrase Mining then Topic Modeling 

ToPMine [El‐Kishky et al. VLDB’15]



First phrase construction, then topic mining



Contrast with KERT: topic modeling, then phrase mining



The ToPMine Framework:



Perform frequent contiguous pattern mining to extract candidate phrases and their counts



Perform agglomerative merging of adjacent unigrams as guided by a significance score—This segments each document into a “bag‐of‐phrases”



The newly formed bag‐of‐phrases are passed as input to PhraseLDA, an extension of LDA, that constrains all words in a phrase to each sharing the same latent topic

14

7

7/26/2015

Why First Phrase Mining then Topic Modeling ? With Strategy 2, tokens in the same phrase may be assigned to different topics  Ex. knowledge discovery using least squares support vector machine classifiers…  Knowledge discovery and support vector machine should have coherent topic labels  Solution: switch the order of phrase mining and topic model inference 

[knowledge discovery] using [least squares] [support vector machine] [classifiers] …

[knowledge discovery] using [least squares] [support vector machine] [classifiers] …

Techniques  Phrase mining and document segmentation  Topic model inference with phrase constraint



15

Phrase Mining: Frequent Pattern Mining + Statistical Analysis Quality phrases

Based on significance score [Church et al.’91]:

α(P1, P2) ≈ (f(P1●P2)  ̶ µ0(P1,P2))/√ f(P1●P2) [Markov blanket] [feature selection] for [support vector machines] [knowledge discovery] using [least squares] [support vector machine] [classifiers] …[support vector] for [machine learning]…

Raw freq.

True freq.

[support vector machine]

90

80

[vector machine]

95

0

[support vector]

100

20

Phrase

16

8

7/26/2015

Collocation Mining  







Collocation: A sequence of words that occur more frequently than expected Often “interesting” and due to their non‐compositionality, often relay information not portrayed by their constituent terms (e.g., “made an exception”, “strong tea”) Many different measures used to extract collocations from a corpus [Dunning 93, Pederson 96] E.g., mutual information, t‐test, z‐test, chi‐squared test, likelihood ratio

Many of these measures can be used to guide the agglomerative phrase‐ segmentation algorithm

17

ToPMine: Phrase LDA (Constrained Topic Modeling) The generative model for PhraseLDA is the same as LDA  Difference: the model incorporates constraints obtained from the “bag‐of‐phrases” input  Chain‐graph shows that all words in a phrase are constrained to take on the same topic values 

[knowledge discovery] using [least squares] [support vector machine] [classifiers] …

Topic model inference with phrase constraints 18

9

7/26/2015

Example Topical Phrases: A Comparison social networks information retrieval web search text classification time series machine learning search engine support vector machines management system information extraction real time neural networks decision trees text categorization : : Topic 1

Topic 2

PDLDA [Lindsey et al. 12] Strategy 1 (3.72 hours)

information retrieval

feature selection

social networks web search search engine information extraction

machine learning semi supervised large scale support vector machines

question answering

active learning

web pages

face recognition

:

:

Topic 1

Topic 2

ToPMine [El‐kishky et al. 14] Strategy 3 (67 seconds)

19

ToPMine: Experiments on DBLP Abstracts

20

10

7/26/2015

ToPMine: Topics on Associate Press News (1989)

21

ToPMine: Experiments on Yelp Reviews

22

11

7/26/2015

Efficiency: Running Time of Different Strategies

Running time: strategy 3 > strategy 2 > strategy 1  (“>” means outperforms)   

Strategy 1: Generate bag‐of‐words → generate sequence of tokens Strategy 2: Post bag‐of‐words model inference, visualize topics with n‐grams Strategy 3: Prior bag‐of‐words model inference, mine phrases and impose to the bag‐of‐words model

23

Coherence of Topics: Comparison of Strategies

Coherence measured by z‐score: strategy 3 > strategy 2 > strategy 1   

Strategy 1: Generate bag‐of‐words → generate sequence of tokens Strategy 2: Post bag‐of‐words model inference, visualize topics with n‐grams Strategy 3: Prior bag‐of‐words model inference, mine phrases and impose to the bag‐of‐words model

24

12

7/26/2015

Phrase Intrusion: Comparison of Strategies

Phrase intrusion measured by average number of correct answers: strategy 3 > strategy 2 > strategy 1 25

Phrase Quality: Comparison of Strategies

Phrase quality measured by z‐score: strategy 3 > strategy 2 > strategy 1 26

13

7/26/2015

Mining Phrases: Why Not Use Raw Frequency Based Methods? Traditional data‐driven approaches



Frequent pattern mining

 

If AB is frequent, likely AB could be a phrase

Raw frequency could NOT reflect the quality of phrases

 

E.g., freq(vector machine) ≥ freq(support vector machine)



Need to rectify the frequency based on segmentation results Phrasal segmentation will tell

 

Some words should be treated as a whole phrase whereas others are still unigrams

27

SegPhrase: From Raw Corpus to Quality Phrases and Segmented Corpus Raw Corpus

Quality Phrases

Segmented Corpus

Document 1 Citation recommendation is an interesting but challenging research problem in data mining area. Document 2 In this study, we investigate the problem in the context of heterogeneous information networks using data mining technique. Document 3 Principal Component Analysis is a linear dimensionality reduction technique commonly used in machine learning applications.

Phrase Mining Input Raw Corpus



E.g., freq(vector machine) ≥ freq(support vector machine)



Need to rectify the frequency based on segmentation results





Segmented Corpus

Raw frequency could NOT reflect the quality of phrases





28

Phrasal Segmentation

Quality Phrases

Build a candidate phrase set by frequent pattern mining Mining frequent k‐grams;  k is typically small, e.g. 6 in our experiments

Popularity measured by raw frequent words and phrases mined from the corpus

14

7/26/2015

SegPhrase: The Overall Framework   

ClassPhrase:  Frequent pattern mining, feature extraction, classification SegPhrase: Phrasal segmentation and phrase quality estimation SegPhrase+: One more round to enhance mined phrase quality

ClassPhrase

SegPhrase(+)

29

What Kind of Phrases Are of “High Quality”? Judging the quality of phrases  Popularity  “information retrieval” vs. “cross‐language information retrieval”  Concordance  “powerful tea” vs. “strong tea”  “active learning” vs. “learning classification”  Informativeness  “this paper” (frequent but not discriminative, not informative)  Completeness  “vector machine” vs. “support vector machine”



30

15

7/26/2015

Feature Extraction: Concordance Partition a phrase into two parts to check whether the co‐occurrence is significantly higher than pure random

 

support vector   machine                this paper   demonstrates



Pointwise mutual information:



Pointwise KL divergence: The additional p(v) is multiplied with pointwise mutual information, leading to less bias towards rare‐occurred phrases



31

Feature Extraction: Informativeness Deriving Informativeness



Quality phrases typically start and end with a non‐stopword

 

“machine learning is” vs. “machine learning”



Use average IDF over words in the phrase to measure the semantics



Usually, the probabilities of a quality phrase in quotes, brackets, or connected by dash should be higher (punctuation information)

 

“state‐of‐the‐art”

We can also incorporate features using some NLP techniques, such as POS tagging, chunking, and semantic parsing

32

16

7/26/2015

Classifier Limited Training



Labels: Whether a phrase is a quality one or not

 

“support vector machine”:  1



“the experiment shows”:    0

 

For ~1GB corpus, only 300 labels Random Forest as our classifier



Predicted phrase quality scores lie in [0, 1]



Bootstrap many different datasets from limited labels

33

SegPhrase: Why Do We Need Phrasal Segmentation in Corpus? Phrasal segmentation can tell which phrase is more appropriate  Ex:  A standard [feature vector] [machine learning] setup is used to describe...



Not counted towards the rectified frequency

Rectified phrase frequency (expected influence)  Example:



34

17

7/26/2015

SegPhrase: Segmentation of Phrases Partition a sequence of words by maximizing the likelihood  Considering



Phrase quality score

 

ClassPhrase assigns a quality score for each phrase



Probability in corpus



Length penalty 

 

length penalty : when

1, it favors shorter phrases

Filter out phrases with low rectified frequency Bad phrases are expected to rarely occur in the segmentation results

35

SegPhrase+: Enhancing Phrasal Segmentation SegPhrase+: One more round for enhanced phrasal segmentation  Feedback  Using rectified frequency, re‐compute those features previously computed based on raw frequency  Process  Classification  Phrasal segmentation   // SegPhrase  Classification  Phrasal segmentation // SegPhrase+  Effects on computing quality scores  np hard in the strong sense  np hard in the strong  data base management system 

36

18

7/26/2015

Performance Study: Methods to Be Compared 

Other phase mining methods:  Methods to be compared NLP chunking based methods

 

Chunks as candidates



Sorted by TF‐IDF and C‐value (K. Frantzi et al., 2000) Unsupervised raw frequency based methods

 

ConExtr (A. Parameswaran et al., VLDB 2010)



ToPMine (A. El‐Kishky et al., VLDB 2015) Supervised method

 

KEA, designed for single document keyphrases (O. Medelyan & I. H. Witten, 2006)

37

Performance Study: Experimental Setting 





Datasets Dataset

#docs

#words

#labels

DBLP

2.77M

91.6M

300

Yelp

4.75M

145.1M

300

Popular Wiki Phrases  Based on internal links  ~7K high quality phrases Pooling  Sampled 500 * 7 Wiki‐uncovered phrases  Evaluated by 3 reviewers independently

38

19

7/26/2015

Performance: Precision Recall Curves on DBLP Compare with other baselines TF‐IDF C‐Value ConExtr KEA ToPMine SegPhrase+

Compare with our 3 variations TF‐IDF ClassPhrase SegPhrase SegPhrase+

39 39

Performance Study: Processing Efficiency 

SegPhrase+ is linear to the size of corpus!

40

20

7/26/2015

Experimental Results: Interesting Phrases Generated (From the Titles and Abstracts of SIGMOD) Query

SIGMOD

Method

SegPhrase+

Chunking (TF‐IDF & C‐Value)

1

data base

data base

2

database system

database system

3

relational database

query processing

4

query optimization

query optimization

5

query processing

relational database

…

…

…

51

sql server

database technology

52

relational data

database server

53

data structure

large volume

54

join query

performance study

55

web service

…

…

Only in SegPhrase+

web service

Only in Chunking

…

201

high dimensional data

202

location based service

efficient implementation sensor network

203

xml schema

large collection

204

two phase locking

important issue

205

deep web

frequent itemset

…

…

…

41

Experimental Results: Interesting Phrases Generated (From the Titles and Abstracts of SIGKDD) Query

SIGKDD

Method

SegPhrase+

1

data mining

Chunking (TF‐IDF & C‐Value) data mining

2

data set

association rule

3

association rule

knowledge discovery

4

knowledge discovery

frequent itemset

5

time series

decision tree

…

…

…

51

association rule mining

search space

52

rule set

domain knowledge

53

concept drift

importnant problem concurrency control

54

knowledge acquisition

55

gene expression data

conceptual graph

…

…

…

201

web content

Only in SegPhrase+

optimal solution

202

frequent subgraph

semantic relationship

203

intrusion detection

effective way

204

categorical attribute

space complexity

205

user preference

small set

…

…

…

Only in Chunking

42

21

7/26/2015

Experimental Results: Similarity Search Find high‐quality similar phrases based on user’s phrase query  In response to a user’s phrase query, SegPhrase+ generates high quality, semantically similar phrases  In DBLP, query on “data mining” and “OLAP”  In Yelp, query on “blu‐ray”, “noodle”, and “valet parking”



43

Recent Progress Distant Training:  No need of human labeling



Training using general knowledge bases

   

E.g., Freebase, Wikipedia

Quality Estimation for Unigrams Integration of phrases and unigrams in one uniform framework



Demo based on DBLP abstract



Multi‐languages:  Beyond English corpus



Extensible to mining quality phrases in multiple languages



Recent progress:  SegPhrase+ works on Chinese, Arabic and Spanish

44

22

7/26/2015

High Quality Phrases Generated from Chinese Wikipedia Rank

Phrase

In English

…

…

…

62

首席_执行官

CEO

63

中间_偏右

Middle‐right

…

…

…

84

百度_百科

Baidu Pedia

85

热带_气旋

Tropical cyclone

86

中国科学院_院士

Fellow of Chinese Academy of Sciences

…

…

…

1001

十大_中文_金曲

Top‐10 Chinese Songs

1002

全球_资讯网

Global News Website

1003

天一阁_藏_明代_科举_录_选刊

A Chinese book name

…

…

…

9934

国家_戏剧_院

National Theater

9935

谢谢_你

Thank you

…

…

…

45

Top-Ranked Phrases Generated from English Gigaword 

Northrop Grumman, Ashfaq Kayani, Sania Mirza, Pius Xii, Shakhtar Donetsk, Kyaw Zaw Lwin



Ratko Mladic, Abdolmalek Rigi, Rubin Kazan, Rajon Rondo, Rubel Hossain, bluefin tuna



Psv Eindhoven, Nicklas Bendtner, Ryo Ishikawa, Desmond Tutu, Landon Donovan, Jannie du Plessis



Zinedine Zidane, Uttar Pradesh, Thor Hushovd, Andhra Pradesh, Jafar_Panahi, Marouane Chamakh



Rahm Emanuel, Yakubu Aiyegbeni, Salva Kiir, Abdelhamid Abou Zeid, Blaise Compaore, Rickie Fowler



Andry Rajoelina, Merck Kgaa, Js Kabylie, Arjun Atwal, Andal Ampatuan Jnr, Reggio Calabria, Ghani Baradar



Mahela Jayawardene, Jemaah Islamiyah, quantitative easing, Nodar Kumaritashvili, Alviro Petersen



Rumiana Jeleva, Helio Castroneves, Koumei Oda, Porfirio Lobo, Anastasia Pavlyuchenkova



Thaksin Shinawatra, Evgeni_Malkin, Salvatore Sirigu, Edoardo Molinari, Yoshito Sengoku



Otago Highlanders, Umar Akmal, Shuaibu Amodu, Nadia Petrova, Jerzy Buzek, Leonid Kuchma,



Alona Bondarenko, Chosun Ilbo, Kei Nishikori, Nobunari Oda, Kumbh Mela, Santo_Domingo



Nicolae Ceausescu, Yoann Gourcuff, Petr Cech, Mirlande Manigat, Sulieman Benn, Sekouba Konate

46

23

7/26/2015

Summary and Future Work Strategy 1: Generate bag‐of‐words → generate sequence of tokens  Integrated complex model; phrase quality and topic inference rely on each other  Slow and overfitting  Strategy 2: Post bag‐of‐words model inference, visualize topics with n‐grams  Phrase quality relies on topic labels for unigrams  Can be fast; generally high‐quality topics and phrases  Strategy 3: Prior bag‐of‐words model inference, mine phrases and impose to the bag‐of‐words model  Topic inference relies on correct segmentation of documents, but not sensitive  Can be fast; generally high‐quality topics and phrases 

47

Summary and Future Work (Cont’d) 

SegPhrase+: A new phrase mining framework



Integrating phrase mining with phrasal segmentation



Requires only limited training or distant training



Generates high‐quality phrases, close to human judgement



Linearly scalable on time and space



Looking forward:  High‐quality, scalable phrase mining



Facilitate entity recognition and typing in large corpora (See the next part of this tutorial)



Combine with linguistic‐rich patterns



Transform massive unstructured data into semi‐structured knowledge networks

48

24

7/26/2015

References for Theme 1: Phrase Mining for Concept Extraction 

D. M. Blei and J. D. Lafferty. Visualizing Topics with Multi‐Word Expressions, arXiv:0907.1013, 2009



K. Church, W. Gale, P. Hanks, D. Hindle. Using Statistics in Lexical Analysis. In U. Zernik (ed.), Lexical Acquisition: Exploiting On‐Line Resources to Build a Lexicon. Lawrence Erlbaum, 1991



M. Danilevsky, C. Wang, N. Desai, X. Ren, J. Guo, J. Han. Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents“, SDM’14



A. El‐Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable Topical Phrase Mining from Text Corpora. VLDB’15



K. Frantzi, S. Ananiadou, and H. Mima, Automatic Recognition of Multi‐Word Terms: the c‐value/nc‐value Method. Int. Journal on Digital Libraries, 3(2), 2000



R. V. Lindsey, W. P. Headden, III, M. J. Stipicevic. A phrase‐discovering topic model using hierarchical pitman‐yor processes, EMNLP‐CoNLL’12.



J. Liu, J. Shang, C. Wang, X. Ren, J. Han, Mining Quality Phrases from Massive Text Corpora. SIGMOD’15



O. Medelyan and I. H. Witten, Thesaurus Based Automatic Keyphrase Indexing. IJCDL’06



Q. Mei, X. Shen, C. Zhai. Automatic Labeling of Multinomial Topic Models, KDD’07



A. Parameswaran, H. Garcia‐Molina, and A. Rajaraman. Towards the Web of Concepts: Extracting Concepts from Large Datasets. VLDB’10



X. Wang, A. McCallum, X. Wei. Topical n‐grams: Phrase and topic discovery, with an application to information retrieval, ICDM’07

49

Theme II: Entity Extraction and Linking by Relational Graph Construction and Propagation

50

25

7/26/2015

Session 1. Task and Traditional NLP Approach

Entity Extraction and Linking Identifying token span as entity mentions in documents and labeling their types Target Types FOOD LOCATION JOB_TITLE EVENT ORGANIZATION

…

The best BBQ I’ve tasted in Phoenix! I had the pulled pork sandwich with coleslaw and baked beans for lunch. ... The owner is very nice. …

The best BBQ:Food I’ve tasted in Phoenix:LOC ! I had the [pulled pork sandwich]:Food with coleslaw:Food and [baked beans]:Food for lunch. … The owner:JOB_TITLE is very nice. …

Text with typed entities

Plain text FOOD

LOCATION

EVENT

Enabling structured analysis of unstructured text corpus 52

26

7/26/2015

Traditional NLP Approach: Data Annotation Extracting and linking entities can be used in a variety of  ways:  serve as primitives for information extraction and knowledge base population  assist question answering,…  Traditional named entity recognition systems are designed for major types (e.g., PER, LOC, ORG) and general domains (e.g., news)  Require additional steps to adapt to new domains/types  Expensive human labor on annotation  500 documents for entity extraction; 20,000 queries for entity linking  Unsatisfying agreement due to various granularity levels and scopes of types  Entities obtained by entity linking techniques have limited coverage and freshness  >50% unlinkable entity mentions in Web corpus [Lin et al., EMNLP’12]  >90% in our experiment corpora: tweets, Yelp reviews, … 

53

Traditional NLP Approach: Feature Engineering Typical Entity Extraction Features (Li et al., 2012)  N‐gram: Unigram, bigram and trigram token sequences in the context window  Part‐of‐Speech: POS tags of the context words  Gazetteers: person names, organizations, countries and cities, titles, idioms, etc.  Word clusters: word clusters / embeddings  Case and Shape: Capitalization and morphology analysis based features  Chunking: NP and VP Chunking tags  Global feature: Sentence level and document level structure/position features



54

27

7/26/2015

Traditional NLP Approach: Feature Engineering 

Typical Entity Linking Features (Ji et al., 2011) Mention/Concept Attribute Name

Document surface

Entity Context Profiling Concept Topic KB Link Mining Popularity

55

Description

Spelling match

Exact string match, acronym match, alias match, string matching…

KB link mining Name Gazetteer Lexical

Name pairs mined from KB text redirect and disambiguation pages Organization and geo‐political entity abbreviation gazetteers

Position Genre Local Context Type Relation/Event Coreference

Web Frequency

Words in KB facts, KB text, mention name, mention text. Tf.idf of words and ngrams Mention name appears early in KB text Genre of the mention text (newswire, blog, …) Lexical and part‐of‐speech tags of context words Mention concept type, subtype Concepts co‐occurred, attributes/relations/events with mention Co‐reference links between the source document and the KB text Slot fills of the mention, concept attributes stored in KB infobox Ontology extracted from KB text Topics (identity and lexical similarity) for the mention text and KB text Attributes extracted from hyperlink graphs of the KB text Top KB text ranked by search engine and its length Frequency in KB texts

Session 2. ClusType: Entity Recognition and Typing by Relation Phrase-Based Clustering

28

7/26/2015

Entity Recognition and Typing: A Data Mining Solution 

Acquire labels for a small amount of instances



Construct a relational graph to connect labeled instances and unlabeled instances



Construct edges based on coarse‐grained data‐driven statistics instead of fine‐ grained linguistic similarity



Mention correlation



Text co‐occurrence



Semantic relatedness based on knowledge graph embeddings



Social networks Label propagation across the graph



57

Case Study 1: Entity Extraction 

Goal: recognizing entity mentions of target types with minimal/no human supervision and with no requirement that entities can be found in a KB.



Two kinds of efforts towards this goal:







Weak supervision: relies on manually selected seed entities in applying pattern‐based bootstrapping methods or label propagation methods to identify more entities Both assume seeds are unambiguous and sufficiently frequent  requires careful seed selection by human Distant supervision: leverages entity information in KBs to reduce human supervision (cont.)

58

29

7/26/2015

Typical Workflow of Distant Supervision  



Detect entity mentions from text Map candidate mentions to KB entities of target types Use confidently mapped {mention, type} to infer types of remaining candidate mentions

59

Problem Definition Distantly‐supervised entity recognition in a domain‐specific corpus  Given:  a corpus D  a knowledge base (e.g., Freebase)  a set of target types (T) from a KB  Detect candidate entity mentions from corpus D  Categorize each candidate mention by target types or Not‐Of‐Interest (NOI), with distant supervision



60

30

7/26/2015

Challenge I: Domain Restriction Most existing work assume entity mentions are already extracted by existing entity detection tools, e.g., noun phrase chunkers  Usually trained on general‐domain corpora like news articles (clean, grammatical)  Make use of various linguistics features (e.g., semantic parsing structures)  Do not work well on specific, dynamic or emerging domains (e.g., tweets, Yelp reviews)  E.g., “in‐and‐out” from Yelp review may not be properly detected



61

Challenge II: Name Ambiguity 

Multiple entities may share the same surface name

While Griffin is not the part of Washington’s plan on Sunday’s game, …

Sport team

…has concern that Kabul is an ally of Washington.

U.S. government

He has office in Washington, Boston and San Francisco

U.S. capital city



Previous methods simply output a single type/type distribution for each surface name, instead of an exact type for each entity mention

While Griffin is not the part of Washington’s plan on Sunday’s game, … … news from Washington indicates that the congress is going to… It is one of the best state parks in Washington.

Washington

State

or

Washington Sport Govern State team ‐ment

…

62

31

7/26/2015

Challenge III: Context Sparsity 

A variety of contextual clues are leveraged to find sources of shared semantics across different entities Keywords, Wiki concepts, linguistic patterns, textual relations, …

 



There are often many ways to describe even the same relation between two entities ID

Sentence

Freq

1

The magnitude 9.0 quake caused widespread devastation in [Kesennuma city]

12

2

… tsunami that ravaged [northeastern Japan] last Friday

31

3

The resulting tsunami devastate [Japan]’s northeast

244

Previous methods have difficulties in handling entity mention with sparse (infrequent) context

63

Our Solution Domain‐agnostic phrase mining algorithm: Extracts candidate entity mentions with minimal linguistic assumption  address domain restriction  E.g., part‐of‐speech (POS) tagging “Miami Heat”)  Incorporate user interests  Non‐linkable entity mention recognization and clustering 

175 Error Distribution

175

Transformation-Based Graph Querying Transformation‐based graph querying produces many results  How to suggest the “best” results to users?  Different transformations, should be weighted differently 

Ex.: Given a single node query, “Geoffrey Hinton” Nodes with “G. Hinton” (Abbreviation transformation) shall be ranked higher than Nodes with “Hinton” (Last token transformation)  How to determine them? Weights shall be learned  Evaluate a Candidate Match: Ranking Function  Features FV (v,  ( v ))   i f i ( v,  (v )) i  Node matching feature:   FE ( e,  ( e))    j g j ( e,  ( e))  Edge matching feature: j  Matching Score: P ( (Q ) | Q )  exp(  FV ( v,  ( v ))   FE ( e,  ( e))) 176

vVQ

eEQ

88

7/26/2015

Potential Challenges in Applying Graph Indices + Search to NLP The graphs automatically constructed from natural language texts might:





Include many more diverse types and weights



Include more ambiguity



Knowledge sparsity; hard to generalize unique nodes/substructures



Include more noise and errors

177

Mining Frequent (Sub)Graph Patterns 

Data mining research has developed many scalable graph pattern mining methods



Given a labeled graph dataset D = {G1, G2, …, Gn), the supporting graph set of a subgraph g is Dg = {Gi | g  Gi, Gi D}.



support(g) = |Dg|/ |D| A (sub)graph g is frequent if support(g) ≥ min_sup



OH

Ex.: Chemical structures



N

S

N

O

 

 178

Alternative: Mining frequent subgraph patterns from a single large graph or network Documents can be viewed as graph structures as well

Graph Dataset

O

O

HO

O

(A)

N

O

(B)

min_sup = 2

O

N

(C) Frequent Graph Patterns O

(1) support = 67%

(2) N

O

N N

89

7/26/2015

Efficient Graph Pattern Mining Methods 

Effective Graph pattern mining methods have been developed for different scenarios Mining graph “transaction” datasets

 

Ex. gSpan (Yan et al, ICDM’02), CloseGraph (Yan et al., KDD’03), … Mining frequent large subgraph structures in single massive network

 

Ex. Spider‐Mine (Zhu, et al., VLDB’11)



Graph pattern mining forms building blocks for graph classification, clustering, compression, comparison, and correlation analysis



Graph indexing and graph similarity search gIndex (SIGMOD’05):  Graph indexing by graph pattern mining

 

Help precise as well as similarity‐based graph query search and answering

179

Mining Collaboration Patterns in DBLP Networks Data description:  600 top confs, 9 major CS areas, 15071 authors in DB/DM  Author labeled by # of papers published in DB/DM  Prolific (P): >=50, Senior (S):  20~49, Junior (J): 10~19, Beginner(B):  5~9 

Patterns found

Instance in data

Instance in data

Patterns found Instance in data Instance in data Patterns found 180

90

7/26/2015

Applications of Graph Pattern Mining Bioinformatics

 

Gene networks, protein interactions, metabolic pathways



Chem‐informatics:  Mining chemical compound structures



Social networks, web communities, tweets,  …



Cell phone networks, computer networks, …



Web graphs, XML structures, semantic Web, information networks



Software engineering: Program execution flow analysis



Building blocks for graph classification, clustering, compression, comparison, and correlation analysis



Graph indexing and graph similarity search

181

Logistic Regression-Based Prediction Model Training and test pair:  =



A‐P‐A‐P‐A

A‐P‐V‐P‐A

A‐P‐T‐P‐A

A‐P‐>P‐A

A‐P‐A

4

5

100

3

Yes = 1

0

1

20

2

No = 0

Logistic Regression Model  Model the probability for each relationship as



   

is the coefficients for each feature (including a constant 1) MLE estimation Maximize the likelihood of observing all the relationships in the training data

 182

91

7/26/2015

When Will It Happen?  

From “whether” to “when” “Whether”: Will Jim rent the movie “Avatar” in Netflix? Output: P(X=1)=?



“When”: When will Jim rent the movie “Avatar”?

Output: A distribution of time!

What is the probability Jim will rent “Avatar” within 2 months? By when Jim will rent “Avatar” with 90% probability? : 0.9  What is the expected time it will take for Jim to rent “Avatar”?  

2

May provide useful information to supply chain management

183

The Relationship Building Time Prediction Model Solution  Directly model relationship building time: P(Y=t)  Geometric distribution, Exponential distribution, Weibull distribution  Use generalized linear model  Deal with censoring (relationship builds beyond the observed time T: Right interval)



Censoring

Generalized Linear Model under Weibull Distribution Assumption

Training Framework 184

92

slides in pdf

Short Description

Description

Comments