A System For Large-Scale Bootstrapping of Synonyms in Taxonomies Martijn Spitters Remko Bonnema Mihai Rotaru Jakub Zavrel Textkernel BV
Taxonomies in IE Systems Automatic Taxonomy Enrichment Experiments Results
◦ String similarity metrics ◦ Linguistic preprocessing ◦ Semantic similarity metrics ◦ Weighted semantic overlap
Conclusions
Practical IE systems require mapping of extracted strings to domain specific taxonomies
Taxonomies ◦ client specific ◦ represent partially similar, partially different categorizations of the domain ◦ two strings can be synonyms in one taxonomy, but be assigned to different concepts in another, more fine-grained taxonomy ◦ concepts need to be enriched with knowledge about typical terminology variation ( synonyms)
59 cardioloog analist cardiologie hartspecialist medisch specialist cardiologie arts-assistent cardiologie arts hart- en vaatziekten Cardioloog thoracaal chirurg medisch specialist cardiothoracale chirurgie hart/long chirurg thorax chirurg - cardiothoracale chirurgie 60 internist - interne geneeskunde - inwendige geneeskunde 61 dermatoloog medisch specialist dermatologie en venerologie specialist dermatologie huidarts - huidtherapeut 62 kinderarts jeugdarts arts jeugdgezondheidszorg pediatrie specialist pediatrie Jeugdgezondheidszorgarts jeugdgezondheidszorg verpleegkundige - arts kindergeneeskunde - (...)
0-4138 Specialist chirurg cardioloog medisch specialist gynaecoloog oncoloog - dermatoloog - internist 0-4139 Verplegend / verzorgend - verpleegster - medewerker verzorging - verplegend personeel - medewerker verpleging/verzorging - medewerker groepsverzorging - verpleeghulp - medewerkenden verzorging - verpleger - verplegende - wijkverpleegkundige - ccu verpleegkundigen - dialyseverpleegkundige - verpleegkundige cardiologie - a-verpleegkundige(n) - geriatrisch verpleegkundige - arboverpleegkundige - verpleegkundige pulmonologie - ziekenverzorgenden (...)
New synonym-taxonomies are enriched manually (very labor-intensive) Process supported with simple string similaritybased methods String similarity not sufficient
concept code
synonyms
PROGR Programmeur - programmeurs
- Java programmeur - programmeur C++ - software designer - sr Oracle Java/J2EE developers - software engineer - webdeveloper
concept description
CNC Programmeur reisprogrammeur programmeur/werkvoorbereider theaterprogrammeur
Goal: reduce manual effort for adding synonyms to new taxonomies
Exploit knowledge in existing taxonomies as a set of ‘semantic contexts’ for known terminology
Implementation:
◦ Uses U. Sheffield’s SimMetrics library ◦ Input: Set of manually enriched taxonomies New ‘target’ taxonomy
◦ Output:
The target taxonomy, enriched with synonyms
Two main steps: 1. Attractor selection for concepts from the new taxonomy (string-based) 2. Link remaining known instances to attractors (semantic)
Taxonomy A
Taxonomy B
Taxonomy C
Instance 1 Instance 2
code Ax
code Bq
code Ci
code Ax
-
code Cj
Instance 3 :
-
-
code Cj
-
-
code Ck
-
code Br
code Ck
code Ay
code Br
-
-
-
code Cl
code Ay
code Bs
code Cm
-
-
code Cm
code Az
code Bt
-
code Az
code Bu
code Cn
-
code Bu
code Cn
code Az
code Bv
-
Link target taxonomy concepts to synonyms (attractors)
Based on string-similarity between concept description and the synonyms
Attractors act as ‘representatives’ in the matrix of the new concept
Non-enriched (target) taxonomy PGR (programmeur)
Instance matrix
accountant RA
ACT (accountant) PRJ
(projectmanager)
C++ programmeur programmeur
Project manager
Character-based metrics ◦ Levenshtein ◦ Q-Grams ◦ Jaro ◦ Jaro-Winkler ◦ Needleman-Wunch
Token-based metrics ◦ Dice ◦ Jaccard ◦ Cosine
Assign remaining instances to concept attractors of the target taxonomy
Based on semantic similarity: compare ‘meaning vectors’ (the matrix rows)
Meaning: the combination of different ‘taxonomy contexts’ of an instance
PGR (programmeur)
Linked attractor instances
ACT (accountant) PRJ
(projectmanager)
Remaining instances X
-
B
Y W
R R
B D
-
Q
A
X
R
B
Y
R
-
Z
-
C
Extraction field: ‘job title’ (resumes) Language: Dutch 21 enriched synonym taxonomies
◦ 3 for development ◦ 9 for testing
Only taxonomies with good concept descriptions (12/21) could be used for testing Discard ‘Unknown/Other’ concepts
Leave-one-out evaluation: ◦ For each test taxonomy t:
tempty is the empty, non-enriched version of t tgold is the manually enriched version of t tenriched is the automatically enriched version of t 1. Create matrix Mwithout_t of all other taxonomies 2. Enrich tempty with instances from matrix Mwithout_t 3. Compare tenriched with tgold
Discard from evaluation: ◦ synonyms in tgold that are not in Mwithout_t ◦ synonyms in tenriched that are not in tgold
Evaluation of string-similarity metrics (instance-pivoted) 100
jaccard (token) dice (token) cosine (token) q-grams levenshtein jaro jaro-winkler needleman-wunch
95
Micro-averaged Precision
90 85
Jaro/Jaro-Winkler best in 92-97% precision region
80 75 70 65
Q-Grams best in <85% precision region
60 55 10
15
20
25
30 35 Micro-averaged Recall
40
45
50
55
Induction of Linguistic Knowledge group (ILK, Tilburg University)
A modular system integrating a tagger, lemmatizer, and morphological segmenter based on TiMBL and MBT
Van den Bosch, A., Busser, G.J., Daelemans, W., and Canisius, S. (2007) An efficient memory-based morphosyntactic tagger and parser for Dutch. In: Selected Papers of the 17th CLIN Meeting, Leuven, Belgium.
Original string
Lemma(‘s)
Morphological tokens
manager ingenieursbureau (electrotechniek)
manager ingenieursbureau electrotechniek
manager Ingenieur bureau electro technisch
onderhoudstimmerlieden
onderhoudstimmerman
onderhoud timmerman
poli-assistenten bloedafname
poli assistent bloedafname
poli assisteer bloed neeme
postkamermedewerker(s)
postkamermedewerker
post kamer mede werk
rechtsbijstandsjurist
rechtsbijstandsjurist
recht bijstand jurist
Evaluation of Q-Grams metric (instance-pivoted) with linguistic preprocessing 100
q-grams q-grams (lemma) q-grams (morph)
95
Micro-averaged Precision
90
85
80
75
70
65
60 10
15
20
25
30 35 Micro-averaged Recall
40
45
50
55
Evaluation of Cosine metric (instance-pivoted) with linguistic preprocessing 100
cosine cosine (lemma) cosine (morph)
95
Micro-averaged Precision
90
85
80
75
70
65 10
15
20
25 30 35 Micro-averaged Recall
40
45
50
F-Measure not suitable for threshold optimization: high precision essential (faulty attractor potentially attracts many new errors in step 2)
Optimize threshold on ‘maximum coverage precision’: macro-averaged precision (conceptpivoted) where concepts without attractor are counted as 0
T1
T2
T...
Tn
T1
T2
T...
Tn
A
Q
X
-
A
Q
-
X
+1
-
R
+0.5
X
-
-
+0.5
R
-
+1
X
n
∑ Sim(x , y ) i
Overlap =
n ⎧ 0.0 if x i ≠ y i ⎪ where : Sim(x i , y i ) = ⎨1.0 if x i = y i ⎪0.5 if x = ω ∨ y = ω ⎩ i i
⎧0.0 if x i ≠ y i where : Sim(x i , y i ) = ⎨ ⎩1.0 if x i = y i
€ €
i
i=1
€
Step 1: string-similarity metrics @ optimal threshold; Step 2: default overlap metric 100
Q-Grams Levenshtein Jaro Jaro-Winkler Cosine
Micro-averaged Precision
95
90
85
80
75
70 10
15
20
25
30
35 40 45 Micro-averaged Recall
50
55
60
65
Evaluation of semantic similarity metrics (instance-pivoted) 100
Baseline (Jaro-Winkler) L1 L2 Q-Grams+Overlap Overlap (sim) Overlap (sim+diff)
95
Micro-averaged Precision
90
26.2
49.1
85
80
Sparseness: matching vectors of length 1 -> score=1.0
75
70
65
60 20
25
30
35
40 45 Micro-averaged Recall
50
55
60
65
Evaluation of semantic similarity metrics (instance-pivoted) 100
Baseline (Jaro) Overlap (sim) Overlap (sim+diff)
95
Micro-averaged Precision
90
25.0
50.1
85
80
75
70
65
60 15
20
25
30
35 40 45 Micro-averaged Recall
50
55
60
65
Step 1 metric
Step 2 metric
Jaro-Winkler
Overlap (sim+diff)
75.4
83.5
56.0
51.5
34.2
Overlap (sim)
75.1
85.4
55.8
51.2
30.7
Overlap (sim+diff)
75.7
84.3
55.7
51.2
38.2
Overlap (sim)
74.9
86.0
54.7
51.2
34.1
Jaro
Max F0.5
Precision
Recall
Recall @ P=90%
Recall @ P=95%
18 -
Sparseness can be a problem:
Vrouwenarts gynaecoloog
attractor (...)
26 -
Wijkverpleegkundige
15 -
Gezondheid / (Para)medisch / Zorg drogist bejaardenhelpenden begrafenisondernemer verpleegk. mbo v verpleegkundige mbov assistent ziekenverzorger wijkverpleegkundige gezinstherapeut klinisch chemisch analisten vrouwenarts gynaecoloog huisarts (...)
CVG_12 Medisch/verzorging verpleegkundige ugor verpleegkundige wijkverpleegkundige - apothekersassistent - apotheker - gynaecoloog - (...)
WIJKVERPLEEGKUNDIGEN WIJKVERPLEGER
Weighing the similarity scores based on granularity of the taxonomy in the matrix might help
Evaluation of weighted overlap (step 1: Jaro-Winkler) 100
Overlap (sim) Weighted Overlap (sim)
Micro-averaged Precision
95
90
85
80
75
70 10
20
30
40 Micro-averaged Recall
50
60
70
A novel method for weakly supervised, fully automatic enrichment of taxonomies: 1. high-precision, string similarity based attractor selection 2. linking of remaining synonym candidates to attractors by comparing meaning vectors
Results: automatic population of a new taxonomy with synonyms. Precision > 90%, recall > 55%
Step 2 doubles recall of step 1 at precision > 90%
This will save at least 55% of our manual work (probably a lot more, but this remains to be tested)