A System For Large-Scale Bootstrapping of Synonyms in Taxonomies

A System For Large-Scale Bootstrapping of Synonyms in Taxonomies Martijn Spitters Remko Bonnema Mihai Rotaru Jakub Zavrel Textkernel BV

Taxonomies in IE Systems   Automatic Taxonomy Enrichment   Experiments   Results  

◦  String similarity metrics ◦  Linguistic preprocessing ◦  Semantic similarity metrics ◦  Weighted semantic overlap  

Conclusions

 

Practical IE systems require mapping of extracted strings to domain specific taxonomies

 

Taxonomies ◦  client specific ◦  represent partially similar, partially different categorizations of the domain ◦  two strings can be synonyms in one taxonomy, but be assigned to different concepts in another, more fine-grained taxonomy ◦  concepts need to be enriched with knowledge about typical terminology variation ( synonyms)

59 cardioloog analist cardiologie hartspecialist medisch specialist cardiologie arts-assistent cardiologie arts hart- en vaatziekten Cardioloog thoracaal chirurg medisch specialist cardiothoracale chirurgie hart/long chirurg thorax chirurg -  cardiothoracale chirurgie 60 internist -  interne geneeskunde -  inwendige geneeskunde 61 dermatoloog medisch specialist dermatologie en venerologie specialist dermatologie huidarts -  huidtherapeut 62 kinderarts jeugdarts arts jeugdgezondheidszorg pediatrie specialist pediatrie Jeugdgezondheidszorgarts jeugdgezondheidszorg verpleegkundige -  arts kindergeneeskunde - (...)

0-4138 Specialist chirurg cardioloog medisch specialist gynaecoloog oncoloog -  dermatoloog -  internist 0-4139 Verplegend / verzorgend -  verpleegster -  medewerker verzorging -  verplegend personeel -  medewerker verpleging/verzorging -  medewerker groepsverzorging -  verpleeghulp -  medewerkenden verzorging -  verpleger -  verplegende -  wijkverpleegkundige -  ccu verpleegkundigen -  dialyseverpleegkundige -  verpleegkundige cardiologie -  a-verpleegkundige(n) -  geriatrisch verpleegkundige -  arboverpleegkundige -  verpleegkundige pulmonologie -  ziekenverzorgenden (...)

New synonym-taxonomies are enriched manually (very labor-intensive)   Process supported with simple string similaritybased methods   String similarity not sufficient  

concept code

synonyms

PROGR Programmeur -  programmeurs

-  Java programmeur -  programmeur C++ -  software designer -  sr Oracle Java/J2EE developers -  software engineer -  webdeveloper

concept description

CNC Programmeur reisprogrammeur programmeur/werkvoorbereider theaterprogrammeur

 

Goal: reduce manual effort for adding synonyms to new taxonomies

 

Exploit knowledge in existing taxonomies as a set of ‘semantic contexts’ for known terminology

 

Implementation:

◦  Uses U. Sheffield’s SimMetrics library ◦  Input:   Set of manually enriched taxonomies   New ‘target’ taxonomy

◦  Output:

  The target taxonomy, enriched with synonyms

 

Two main steps: 1.  Attractor selection for concepts from the new taxonomy (string-based) 2.  Link remaining known instances to attractors (semantic)

Taxonomy A

Taxonomy B

Taxonomy C

Instance 1 Instance 2

code Ax

code Bq

code Ci

code Ax

-

code Cj

Instance 3 :

-

-

code Cj

-

-

code Ck

-

code Br

code Ck

code Ay

code Br

-

-

-

code Cl

code Ay

code Bs

code Cm

-

-

code Cm

code Az

code Bt

-

code Az

code Bu

code Cn

-

code Bu

code Cn

code Az

code Bv

-

 

Link target taxonomy concepts to synonyms (attractors)

 

Based on string-similarity between concept description and the synonyms

 

Attractors act as ‘representatives’ in the matrix of the new concept

Non-enriched (target) taxonomy PGR (programmeur)

Instance matrix

accountant RA

ACT (accountant) PRJ

(projectmanager)

C++ programmeur programmeur

Project manager

 

Character-based metrics ◦  Levenshtein ◦  Q-Grams ◦  Jaro ◦  Jaro-Winkler ◦  Needleman-Wunch

 

Token-based metrics ◦  Dice ◦  Jaccard ◦  Cosine

 

Assign remaining instances to concept attractors of the target taxonomy

 

Based on semantic similarity: compare ‘meaning vectors’ (the matrix rows)

 

Meaning: the combination of different ‘taxonomy contexts’ of an instance

PGR (programmeur)

Linked attractor instances

ACT (accountant) PRJ

(projectmanager)

Remaining instances X

-

B

Y W

R R

B D

-

Q

A

X

R

B

Y

R

-

Z

-

C

Extraction field: ‘job title’ (resumes)   Language: Dutch   21 enriched synonym taxonomies  

◦  3 for development ◦  9 for testing

Only taxonomies with good concept descriptions (12/21) could be used for testing   Discard ‘Unknown/Other’ concepts  

 

Leave-one-out evaluation: ◦  For each test taxonomy t:

  tempty is the empty, non-enriched version of t   tgold is the manually enriched version of t   tenriched is the automatically enriched version of t 1.  Create matrix Mwithout_t of all other taxonomies 2.  Enrich tempty with instances from matrix Mwithout_t 3.  Compare tenriched with tgold  

Discard from evaluation: ◦  synonyms in tgold that are not in Mwithout_t ◦  synonyms in tenriched that are not in tgold

Evaluation of string-similarity metrics (instance-pivoted) 100

jaccard (token) dice (token) cosine (token) q-grams levenshtein jaro jaro-winkler needleman-wunch

95

Micro-averaged Precision

90 85

Jaro/Jaro-Winkler best in 92-97% precision region

80 75 70 65

Q-Grams best in <85% precision region

60 55 10

15

20

25

30 35 Micro-averaged Recall

40

45

50

55

 

Induction of Linguistic Knowledge group (ILK, Tilburg University)

 

A modular system integrating a tagger, lemmatizer, and morphological segmenter based on TiMBL and MBT

 

Van den Bosch, A., Busser, G.J., Daelemans, W., and Canisius, S. (2007) An efficient memory-based morphosyntactic tagger and parser for Dutch. In: Selected Papers of the 17th CLIN Meeting, Leuven, Belgium.

Original string

Lemma(‘s)

Morphological tokens

manager ingenieursbureau (electrotechniek)

manager ingenieursbureau electrotechniek

manager Ingenieur bureau electro technisch

onderhoudstimmerlieden

onderhoudstimmerman

onderhoud timmerman

poli-assistenten bloedafname

poli assistent bloedafname

poli assisteer bloed neeme

postkamermedewerker(s)

postkamermedewerker

post kamer mede werk

rechtsbijstandsjurist

rechtsbijstandsjurist

recht bijstand jurist

Evaluation of Q-Grams metric (instance-pivoted) with linguistic preprocessing 100

q-grams q-grams (lemma) q-grams (morph)

95


90

85

80

75

70

65

60 10

15

20

25


40

45

50

55

Evaluation of Cosine metric (instance-pivoted) with linguistic preprocessing 100

cosine cosine (lemma) cosine (morph)

95


90

85

80

75

70

65 10

15

20

25 30 35 Micro-averaged Recall

40

45

50

 

F-Measure not suitable for threshold optimization: high precision essential (faulty attractor potentially attracts many new errors in step 2)

 

Optimize threshold on ‘maximum coverage precision’: macro-averaged precision (conceptpivoted) where concepts without attractor are counted as 0

T1

T2

T...

Tn

T1

T2

T...

Tn

A

Q

X

-

A

Q

-

X

+1

-

R

+0.5

X

-

-

+0.5

R

-

+1

X

n

∑ Sim(x , y ) i

Overlap =

n ⎧ 0.0 if x i ≠ y i ⎪ where : Sim(x i , y i ) = ⎨1.0 if x i = y i ⎪0.5 if x = ω ∨ y = ω ⎩ i i

⎧0.0 if x i ≠ y i where : Sim(x i , y i ) = ⎨ ⎩1.0 if x i = y i

€ €

i

i=1

€

Step 1: string-similarity metrics @ optimal threshold; Step 2: default overlap metric 100

Q-Grams Levenshtein Jaro Jaro-Winkler Cosine


95

90

85

80

75

70 10

15

20

25

30


50

55

60

65

Evaluation of semantic similarity metrics (instance-pivoted) 100

Baseline (Jaro-Winkler) L1 L2 Q-Grams+Overlap Overlap (sim) Overlap (sim+diff)

95


90

26.2

49.1

85

80

Sparseness: matching vectors of length 1 -> score=1.0

75

70

65

60 20

25

30

35


50

55

60

65

Evaluation of semantic similarity metrics (instance-pivoted) 100

Baseline (Jaro) Overlap (sim) Overlap (sim+diff)

95


90

25.0

50.1

85

80

75

70

65

60 15

20

25

30


50

55

60

65

Step 1 metric

Step 2 metric

Jaro-Winkler

Overlap (sim+diff)

75.4

83.5

56.0

51.5

34.2

Overlap (sim)

75.1

85.4

55.8

51.2

30.7

Overlap (sim+diff)

75.7

84.3

55.7

51.2

38.2

Overlap (sim)

74.9

86.0

54.7

51.2

34.1

Jaro

Max F0.5

Precision

Recall

Recall @ P=90%

Recall @ P=95%

  18 - 

Sparseness can be a problem:

Vrouwenarts gynaecoloog

attractor (...)

26 - 

Wijkverpleegkundige

15 -

Gezondheid / (Para)medisch / Zorg drogist bejaardenhelpenden begrafenisondernemer verpleegk. mbo v verpleegkundige mbov assistent ziekenverzorger wijkverpleegkundige gezinstherapeut klinisch chemisch analisten vrouwenarts gynaecoloog huisarts (...)

CVG_12 Medisch/verzorging verpleegkundige ugor verpleegkundige wijkverpleegkundige -  apothekersassistent -  apotheker -  gynaecoloog -  (...)

WIJKVERPLEEGKUNDIGEN WIJKVERPLEGER

 

Weighing the similarity scores based on granularity of the taxonomy in the matrix might help

Evaluation of weighted overlap (step 1: Jaro-Winkler) 100

Overlap (sim) Weighted Overlap (sim)


95

90

85

80

75

70 10

20

30

40 Micro-averaged Recall

50

60

70

 

A novel method for weakly supervised, fully automatic enrichment of taxonomies: 1.  high-precision, string similarity based attractor selection 2.  linking of remaining synonym candidates to attractors by comparing meaning vectors

 

Results: automatic population of a new taxonomy with synonyms. Precision > 90%, recall > 55%

 

Step 2 doubles recall of step 1 at precision > 90%

 

This will save at least 55% of our manual work (probably a lot more, but this remains to be tested)

A System For Large-Scale Bootstrapping of Synonyms in Taxonomies

Recommend Documents