Automatic Codebook Acquisition

Automatic Codebook Acquisition Paper prepared for the workhop Methods and Techniques Innovations and Applications in Political Science Politicologenetmaal 2005 (Antwerp, 19-20 May)

Wouter van Atteveldt [email protected] Department of Communication Science & Department of Artificial Intelligence Free University Amsterdam

Contents 1 Introduction

2

2 Word lists in Content Analysis

3

2.1

Word List Creation . . . . . . . . . . . . . . . . . . . . . . . .

4

2.2

Latent Semantics Analysis . . . . . . . . . . . . . . . . . . . .

5

2.3

Synonym extraction using word distance . . . . . . . . . . . .

6

3 Methodology

6

3.1

Term Extraction . . . . . . . . . . . . . . . . . . . . . . . . .

6

3.2

Latent Semantics Analysis . . . . . . . . . . . . . . . . . . . .

7

3.3

Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

4 Domain and Corpus: Dutch political news

9

4.1

Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

4.2

Term list: A “silver standard” . . . . . . . . . . . . . . . . . .

9

5 Results

10

5.1

Term extraction . . . . . . . . . . . . . . . . . . . . . . . . . .

10

5.2

Synonym extraction . . . . . . . . . . . . . . . . . . . . . . .

14

6 Summary and Discussion

17

6.1

Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

6.2

Possibilities and Limitations . . . . . . . . . . . . . . . . . . .

18

References

18

Appendix A: Stoplist

21

Appendix B: Technical Details

21

Appendix C: Synonym lists

21

1

1

Introduction

Content Analysis is often necessary for researching the effects of or patterns in political communication, such as election studies and agenda setting (Kleinnijenhuis et al. 2003; Bryant and Zillman 2002; McCombs and Shaw 1972). Classification or coding is a crucial step in the execution of most Content Analyses (Krippendorff 2004; Holsti 1969). In this step, a textual unit (word, clause, sentence, paragraph, or document) is assigned a category from a fixed or open list of accepted categories; examples of such categories are frames, actors, issues, or emotions. Later analysis is then performed on these categories rather than on the texts themselves. For the repeatability and accuracy of this coding, it is crucial that this categorisation scheme (codebook or dictionary) is exhaustive and unambiguous. This is true for both manual and computer-driven content analysis, but especially for the latter these categories need to be very well defined as the computer needs to be able to assign these categories to the textual units without true understanding of the text. For both types of analysis the creation of a categorisation scheme or entity list that is exhaustive is critical since computers cannot add missing entities to the list. If human coders add entities during coding this leads to more intepretation by the coders, and thus to lower repeatability. Researchers are relatively good at deciding whether a given entity should belong to the entity list. Thinking of all possible entities that could be relevant for a domain, however, is more difficult. The same holds for thinking of different ways an entity can materialize in a text, for example which words or synonyms are indicative of that entity. For example, Tony Blair could be referred to as ‘Tony’, ‘Mr. Blair’, ‘the labour leader’,‘No. 11’, ‘Cherie’s husband’, ‘Sedgefield’s MP’ et cetera. This paper proposes a method to alleviate both these problems, by having the computer automatically suggest relevant concepts/entites and synonyms for these entities based on relative word counts and co-occurrence patterns. This method will be tested by comparing it to the actually entity lists used for four Dutch election campaing studies. Section 2 will provide a short overview of the relevant literature in the fields of Communication Science and corpus based Natural Language Processing methods, ending with an overview of the method that we propose. Then, section 3 will describe the technical details and choices made in implementing this method. Section 4 will introduce the target domain and the corpora that 2

were used for the evaluation, and in section 5 the results of this evaluation will be presented. The last section will briefly discuss the limitations and possibilities of this technique.

2

Word lists in Content Analysis

Many instances of computer content analysis use some form of word list. There are many word lists for determining relations or emotions, such as Osgood’s original list of evaluative terms, the lists in General Inquirer and the emotional word lists of Pennebaker (Osgood et al. 1967; Stone 1997; Stone et al. 1962; Pennebaker et al. 2001). Programs that automatically detect the mentioning of actors or issues in texts, such as Profiler+, KEDS/TABARI, and the General Inquirer also implicitely use word lists by having certain words act as anchors or triggers to signal the presence of one of the actors under investigation (Young 2001; Schrodt 2001). Moreover, non-automated content analysis also use a form of word lists in the form of codebooks or coding instructions. Although these need not exhaustively list all synonyms of a word due to the linguistic ability of the human coders, it is still of vital importance that these instructions be as complete and detailed as possible. The exact content of a codebook can be seen as the researcher’s context as meant by Krippendorff (2004). The word lists that has received most focus in scientific research are the ones that measure specific evaluations, affective value or emotions, such as those mentioned above. Although these lists are generally created manually, research has also been done into the automatic creation and evaluation of such word lists (Bestgen 2002; Kamps and Marx 2002; Turney 2002; van Atteveldt et al. 2004). These word lists can be seen as defining the evaluative categories by extension. For many analyses, however, the researcher is not solely interested in which emotions or evaluations occur, but rather in what actors or issues are being evaluated. Thus, references to these actors or issues, here collectively called entities, also need to be determined. This is also a coding or classification problem. In traditional content analysis, these entities are defined by intention and the task of finding the referents is left to human coders. In automatic analysis, such definitions also need to be extensional by listing all possible synonyms for these entities.

3

This study focusses on the latter type of word lists: synonym lists for automatically discovering references to entities in a text. The results, however, are also applicable to human annotation, as the entity or category descriptions need to be as unambiguous as possible. Moreover, by suggesting clusters of important terms, the results can also help in creating an exhaustive coding scheme, since humans can easily overlook certain entities that could be important, especially if the coding scheme is very extensive.

2.1

Word List Creation

As argued above, acquiring an exhaustive list of important terms and their descriptions or synonyms is of vital importance for meaningful content analysis. There are a number of ways to obtain such a list, which can be roughly categorized as follows: Hand-crafted Probably the most common source for word lists is the Content Analysis researcher. Using his or her own linguistic ability, the researcher thinks of the words that will probably refer to the concepts under investigation. The main disadvantage of this approach is caused by the enormous variety of language: any list a person can think of beforehand will be incomplete with respect to an actual corpus (errors of ommission). This is analogous to the finding in Natural Language Processing that handcrafted rule sets will always miss many of the rarer cases (Manning and Schtze 1999; Holsti 1969). However, as such lists are used extensively and improved through this use, such as the lists in the General Inquirer and other standardized lists, this problem becomes less severe as the list becomes an actual reflection of the bodies of text on which it is used. But this limits the use of these lists to general and relatively static topics, which makes it less suited for the finding of political actors or issues, which are usually dynamic and specific in nature. Lexicographic sources Another approach is to use the prior efforts of lexicographers. In most languages, machine readable dictionaries and thesauri are available as a source of synonyms and hypernyms; for example, Roget’s Thesaurus for English or Brouwers for Dutch (Brouwers 1989; Kirkpatrick 1998). From these sources, term lists can be enriched with synonyms by looking up all synonyms and hyponyms, and in theory it can be used as a source for suggesting new terms by exploiting the hierarchical nature of thesauri. 4

However, since the synonyms generally have no frequency information attached, it has been found that using such sources often result in including words that might have the intended meaning but generally mean something else. In fact, these sources generally include many homonyms1 of the words and lead the disambiguation to the language faculty of the human users. This results in errors of comission if applied automatically, since the computer cannot easily perform this disambiguation (van Atteveldt et al. 2004). Additionally, due to the fact that these lists are based on the hard labour of specialized professionals, these lists are limited in size (for example, the Brouwer’s thesaurus contains about 300,000 words divided into one thousand categories), infrequently updated, and general in domain. Although the latter seems an advantage, having an equivalently sized word list that is specific to the domain under investigation will both provide better coverage and have less homonymy problems. WordNet Although strictly speaking WordNet (Miller 1990; Miller 1995) can be considered a thesaurus, it has been used often enough to warrant specific mention. Especially for determining the evaluative or affective charge of modifiers, WordNet has proven a very useful resource (Kamps and Marx 2002; Moldovan and Rus 2001). Moreover, since WordNet has very detailed relations it is possible to devise metrics for the semantic distance between words (Banerjee and Pedersen 2003; Church and Hanks 1989), similar to the distance metrics using co-occurence patterns described below. However, this approach has two drawbacks: although WordNet is available in more languages (Vossen 1999), these WordNets contain far less terms than the orinigal English WordNet and are less suited as a model of language use. Moreover, the problem of applying general lexicographic sources to very specific domains applies to WordNet as well.

2.2

Latent Semantics Analysis

Latent Semantics Analysis (Landauer and Dumais 1997; Deerwester et al. 1990) is a method of leveraging co-occurrence statistics to estimate the semantic distance between words from their co-occurrence with other words. Technically, the underlying dimensionality of the document-term matrix (the table containing the frequency of each word in each document), is reduced by 1

Homomyms are two words with the same spelling but a differnt meaning, as opposed to synonyms which have the same meaning but different spelling

5

applying Singular Value Decomposition, a generalized form of Factor Analysis. Thus, new dimensions or factors are formed based on the (co-)occurrence patterns of each word in the documents. Then, the original term-document matrix is recreated as a projection of the underlying (lower-dimensional) space on the original space. Landauer and Dumais (1997) found that lexical acquisition using a cosine distance metric on a corpus reduced by LSA has characteristics similar to human lexical acquisition.

2.3

Synonym extraction using word distance

Two observations that were made above are central to the method proposed in this paper: exhaustive codebooks are important for Content Analysis; and humans are good at precision but worse at recall while the opposite holds for automatic extraction from corpora. Thus, by combining the strenghts and weaknesses of humans and computers, it might be possible to arrive at a codebook that is both correct and exhaustive. This method can be outlined as follows: Term Extraction using frequency statistics, dimensionality reduction using Latent Semantics Analysis to leverage the contextual information in these terms, and clustering based on a distance metric defined on the reduced document-term matrix .

3

Methodology

The process described in this paper consists of three steps: term extraction, latent semantics analysis, and term clustering. These three steps, and the evaluation that was performed to assess the quality of the result, are described below.

3.1

Term Extraction

The extraction of the relevant terms was done by measuring which terms occur most often in the set of documents of interest (the target corpus) as compared to a more or less general set of documents of the same type, the reference corpus. To extract these terms, for each word the χ2 value of its frequency in the target corpus as compared to the refence corpus was determined. Candidate terms were words that occurred at least 5 times, did

6

not occur on a stop list for Dutch2 , and occurred significantly more often in the target corpus than in the reference corpus (ie χ2 > 6.75, p = 0.01). Note that in this study, terms are seen as single words. It is relatively straightforward to extend this to multiword terms (n-grams) by determining the χ2 values of these n-grams. It can also be determined whether these n-grams occur significantly more often than would be predicted from the underlying word frequencies (Manning and Schtze 1999).

3.2

Latent Semantics Analysis

As described in section 2.2, Latent Semantics Analysis (LSA) is a method of leveraging co-occurrence statistics to estimate the semantic distance between words from their co-occurrence patterns with other words. This section described the choices made in applying LSA to the corpus, in particular concerning term and document selection, units of analysis, and weighting. Sampling Although in principle all words and documents can be used for this procedure, a selection was made for reasons of computability. To create a table containing the counts of the words in the different documents, called a term-document matrix, selection can be applied to both the columns and the rows of the hypothetical full table. For the documents, a random subset of 100,000 (17%) documents was taken from the target corpus. For the terms, the 2,500 terms with the highest χ2 value were used. Moreover, an additional 50,000 terms were used to form the ‘co-occurrence context’. The reason for this is that although a term such as ’deciding’ is not a political actor, the occurrence of this term can help differentiate between a minister and an action group. Thus, although we are not interested in the actual distance between the target terms and the contextual terms, these terms can help determine the distance between the target terms. These contextual terms were selected based on the measure of their total frequency multiplied by their information content3 , which is a compromise meant to exclude infrequent and uninformative words. Units of analysis In creating this term-document matrix, there are also two units of analysis: what constitutes a term, and what constitutes a document. In this study, terms are defined as single words, and documents are 2 3

See appendix A for this list see also appendix B

7

full articles. Alternative choices that are very interesting for future investigation are multi-word terms in addition to single words, and smaller textual contexts, effectively treating paragraphs or sentences as separate documents. Weighting Although the raw word frequencies can be used in the documentterm matrix, it has been found that results are greatly improved by using a combination of local weighting, global weighting, and normalization. In this study, local weighting was performed by using the logarithm of the raw counts in the document, which means that the weight increase by adding a new word to the document decreases as the word already occurs more often. The global weight of terms was defined as the conditional entropy of the documents given the term frequency divided by the unconditional entropy of the documents. This global weight is an indication of the information content of the term, and will be zero if all documents contain the word, as that means the term gives no information about the documents it occurs in. Finally, the term scores in each document were normalized such that they sum to one, giving equal weight to each article regardless of the length of the article. These choices are fairly standard in applying LSA and are the choices that proved most effective in (Nakov et al. 2001).4 Dimensions Analogously to deciding the amount of factors in a Factor Analysis, one of the decisions to make in Latent Semantics Analysis is the dimensionality of the underlying ‘semantic’ space. Previous research points to an optimum somewhere around 300-400 (Landauer and Dumais 1997; abd B.M. Pincombe and Welsh ). For this study 400 dimensions were used.

3.3

Clustering

To generate a list of synonyms per concept or entity, possibly overlapping clusters were made by including the closest n words to each entity. Closeness here is defined as the Euclidean distance between the document frequency vectors representing the words (the columns of the term-document matrix). These choices were mainly made for practical reasons. Especially using cosine distance instead of Euclidean distance may prove a better approximation to human language processing (as found by Landauer and Dumais (1997) and might avoid some spurious clusterings based on total word frequency rather than actual usage patterns, as Euclidean distance is not nor4

See appendix B for a more detailed description of the weights used.

8

malised for length. Additionally, it might be preferable to use a form of clustering that minimizes total inter-cluster distance rather than only the distance to the original term (the centroid). This can help avoid problems where a word with less frequent homonyms also is assigned the synonyms referring to these homonyms rather than the intended concept.

4 4.1

Domain and Corpus: Dutch political news Corpora

The domain of this study is Dutch politics outside of election time. The corpora used were gathered for an investigation of the news coverage of the different ministries done on behalf of the Rijksvoorlichtingsdienst (Service of Public Information), a service that is responsible for the public relations of the government. The corpora both contain newspaper articles published in the 5 Dutch national newspapers and a number of regional daily newspapers in the period 2003-2004. As explained in section 3.1, terms were extracted based on relative frequency in a target corpus compared to a reference corpus. The target corpus used in this study consisted of the articles about the main actors and contains all articles that contain one of the ministers or ministries under investigation (600,000 in total). The reference corpus, approximately 1,200,000 articles large) was selected based on specific issues that the Service was interested in, so it contained no direct selection on actors.

4.2

Term list: A “silver standard”

Part of this research was an automated measure of the attention to the included actors and issues. For this purpose, a hierarchic entity list of terms and synonyms was created. Although this suffered all the qualms of handcrafted lists as described above, it was produced in a number of iterations through the duration of the project and is partly based on similar lists that have been developed since the 1994 Dutch parliamentary elections (Kleinnijenhuis et al. 2003, and references therin). Thus, although I expect this list to miss many of the important terms used in the target corpus and are a specific view on the domain, it can still serve as a base for comparison and should not contain too many errors of commission. The way this list has been used will be described in more detail in section 5 below. 9

To relate performance to characteristics of the entity, the entities were categorized in three ways: political versus societal actors; abstract concepts, concrete institutions, and persons; and important versus unimportant actors. This latter distinction is a subjective ranking that was jointly determined by two coders, with as guideline that all cabinet members, party leaders and well known societal actors are important. Since this distinction is purely explorative, the lack of a formal evaluation of this ranking is not of great importance. Table 1 lists these categories and the number of entities in each of these categories. Non-political Political

Unimportant Important Unimportant Important

Total

Concept 20 29 1 1 51

Institution 300 144 0 31 475

Person 2 7 230 51 290

Total 322 180 231 83 816

Table 1: Quantitative description of the ‘silver standard’ Term List

5

Results

5.1

Term extraction

Before synonyms can be extracted for the concepts or entities in the list, the terms themselves need to be extracted. Two central statistics for evaluation of an extraction process are recall and precision, which are fairly standard measures in Information Extraction (Pazienza 2003). Recall is the percentage of correct entities that were found, making it a measure of the errors of ommission. Precision is the percentage of found terms that actually corresponded to an entity in the list, measuring the errors of commission. Recall To determine recall, the extracted terms were compared to the ‘silver standard’ term list defined in section 4.2. Although we do not expect this list to be complete, we do assume that all terms in the list are correct. Thus, the recall on the terms in the list is a good indicator of the actual recall. For

10

technical reasons, the term list was here limited to 65.535 terms. Table 2 below shows the recall for the categories defined in table 1. Non-political Political

Unimp. Imp. Unimp. Imp.

Concept 20% 41%

Institution 36% 50%

33%

90% 44%

Total

Person

84% 96% 86%

Total 35% 49% 84% 93% 58%

Table 2: Recall of the method (cells with n<25 ommitted, total n=816) Although the average recall of 58% is not fantastic, it is a reasonable score. Moreover, the recall for political actors is around 90%, going up to 96% for important persons. The general tenedency seems to be that the more concrete, important and political a concept is, the better the method performs. This is not surprising: concrete concepts are more likely to have a good correspondence to words; important concepts are less affected by data scarcity, which is always problematic in corpus-based NLP methods; and the selection criterion for the documents in the target corpus were political terms, making it more likely for these terms to occur often in this corpus. Apart from whether the entity had a corresponding term in the χ2 list, it is interesting to know on what position of the list it occurred, since the higher up the list the relevant terms are, the shorter the list that needs to be considered. Moreover, since it would be interesting if the matching of entities and terms can be done automatically, it was tested how often the match could be determined by a simple heuristic. This heuristic picks, in order of preference, a direct match of the whole entity, a direct match of one of the words in the entity description, or the closest match to one of the words. Table 3 below contains these two additional scores. Concept Non-pol. Political Total

Unimp. Imp. Unimp. Imp.

Institution

Av.Pos

Acc.

Av.Pos

Acc.

3,310 3,962

75% 100%

5,497 6,700

86% 65%

88%

648 5,545

75% 78%

3,646

Person Av.Pos

9,537 1,197 7,800

Total

Acc.

Av.Pos

Acc.

94% 78% 91%

5,332 6,030 9,517 978 6,228

86% 71% 94% 77% 85%

Table 3: Average rank of the correct terms in the χ2 list and accuracy of the automatic match method (cells with n<25 ommitted, total n=816)

11

As can be seen in the table, the average position of terms on the list is much higher than the arbitrary cutoff point of 2,500 that was chosen earlier. This means that many concepts will not appear on this shorter list. The main exception here are the important political actors, which have an average position below thousand, which is in line with the recall results. Curiously, in general the unimportant political actors were lower on the list than the non-political actors, and it seems that the more concrete terms were actually lower on the list than the abstract terms. The automatic matching heuristic performed well, averaging 90% for people and abstract concepts and around 80% for institutions. It is interesting that important entities were more difficult to match than unimportant entities for both persons and institutions. This seems mainly due to three artefacts in the data: relatively obscure acronyms for the ministries; using prefixes in the names of ministers and secretaries but not for MP’s; and listing ’Royal’ or ’Society for’ in front of some important non-political institutions. Precision As defined above, precision measures the percentage of terms that were extracted that actually correspond to entities. Since we cannot assume that the Silver Standard is complete as well as correct, a term not corresponding to an entity in the list does not imply that the term is not relevant: the entity could be missing from the term list. Thus, the precision measured on the silver standard is a lower bound for the precision of the method. To get an upper bound on the precision of the method, a 10% sample of the first 10.000 terms was reviewed by two domain experts to judge whether these terms are relevant for constructing an entity list of the domain. Since many of these words will be duplicated with slight variation is spelling or conjugation, this is only an upper bound. The true precision of the method will be somewhere between these two figures. A common graph in Information Extraction is the precision/recall curve. It shows the drop in precision by lowering the threshold to get higher recall. In this case, as the arbitrary cutoff point is lowered, recall increases at the price of lower precision. Figure 1 shows this curve for the two bounds on precision, with the points indicating a thousand position increase of the cutoff point. As we can see, the lower bound on precision is quite low: around 13%. Moreover, to get acceptable recall, this precision will have to be lowered to around 5%. The upper bound is much better: around 60% of the first 1,000 12

70%

Silver Standard Expert

60%

Precision

50% 40% 30% 20% 10% 0% 0%

10%

20%

30%

40%

Recall

Figure 1: Precision/Recall curves terms were judged relevant by the human experts, dropping to slightly lower than 30% for recall scores of around 40%. The F-score, an harmonic average of recall and precision, is around 30% regardless of cutoff point (not shown in the graph). Which means... The above results for precision and recall, which indicate an upper bound on the F-score of around 0.3, mean that the method is not yet suited as an automatic way to extract an entity list from the target corpus. However, the method can certainly be useful for aiding a researcher in constructing this list: by looking at the first couple of thousand terms, the researcher can quickly see whether some important concepts were ommitted. As people are generally better at filtering out irrelevant terms than in thinking of all relevant concepts, while the computer reaches recall of over 90% for certain categories, this can prove a very good combination.

13

5.2

Synonym extraction

The second and third step of the process, dimensionality reduction and clustering, result in a list of candidate-synonyms per entity. These synonyms are the words that were closest to the word that was matched to the entity in the first step. To get a clean evaluation of this step, a random sample of those entities for which a term was found in the first 2,500 terms was evaluated. This evaluation was performed by having two domain experts determine the correctness of the first 25 candidate-synonyms for these entities. Three categories were made: direct synonyms (’Prime Minister’5 for the prime minister), indicators (’Party leader’ for the leader of Labour), and irrelevant words (’Powell’ for the Dutch ministry of Foreign Affairs). The raw results of this evaluation can be seen in figure 2. In this figure, each row represents one entity, and the different synonyms are shown according to their distance from the entity as determined by the clustering. The rows were sorted by actor type and importance. Although this figure is difficult to interpret, some tentative conclusions can be drawn. The correct synonyms are not all clustered together, and the different synonyms for one term are spread in distance fairly continuously. Thus, it does not seem sensible to define a cutoff point for the synonyms based on the measured distance. To get a better overview of the performance of the method per entity category, two aggregate scores are presented in table 4 below: the average number of synonyms and indicators in the first 25 candidates; and the average position of these relevant terms. This latter score is an indicator of the precision of the method, while the former is an indicator of whether enough synonyms were found to make the method useful. The recall of the method is very difficult to determine since we have no good standard for comparison. It would be very interesting to compare the found synonyms to manually annotated texts (which are available for the election studies), but this is left as future work. On average, 1.4 direct synonyms and 2.6 indicator words were found among the first 25 synonyms for a term. As in general the average position in a list of 25 items is 13, the average positions of 7 and 11 for these words indicates that the distance is a useful measure of closeness. The method scores poorly on persons; this is presumably due to the fact that not many synonyms for a person exist apart from his or her name and function. For institutions and abstract concepts, more synonyms are found. 5

These multiword terms are single words in Dutch

14

Other Actors imp

Political Actors unimp

imp

unimp

Distance to concept Error Indicator Synonym

Figure 2: Results of the synonym generation; rows represent entities, points candidate-synonyms. Example: The top row is Geert Wilders, of which the first candidate is a direct synonym: geert, followed at some distance by a cluster containing two indicators: liberale and liberaal and a lot of noise. See also appendix C

15

Concept Av.Pos NP

Pol

Unimp

Syn Ind

Imp

Syn Ind Syn Ind Syn

Unimp Imp

Total

Ind Syn Ind

9.1 14.3

8.7 13.9

Institution

Amnt

Av.Pos

Amnt

5.3 4.4

6.5 11.9 7.4 8.9

1.0 6.7 1.6 2.8

4.4 3.9

7.5 11.5 7.3 10.4

0.9 1.7 1.3 3.1

Person Av.Pos

7.0 8.8 4.6 9.7 5.1 9.5

Total

Amnt

Av.Pos

Amnt

0.8 1.0 1.1 2.0 0.9 1.6

5.9 11.9 8.0 10.3 7.0 8.8 5.8 10.4 7.2 10.7

1.1 5.9 2.2 2.9 0.8 1.0 1.0 1.8 1.4 2.6

Table 4: Average position of the relevant terms among the candidate-synonyms and the amount of relevant terms in the first 25 candidate-synonyms (cells with n<5 ommitted, total n=104)

Manual inspection of these synonyms6 revealed some interesting results: • Even though the domain is limited, homonymy still forms a problem, although a lot smaller than would be expected otherwise. For example, ‘As’ is both an MP and the translation of ’Axis [of evil]’, resulting in many iraq related synonyms; and ’camp’ and ’kamp’ are an MP and minister as well as words for military encampments, the first being part of the Dutch ’Camp Smitty’ in Iraq and the latter being the Dutch word meaning camp. Named Entity Recognition and POS tagging might well avoid many of these, as the main problem seems to be proper names being mixed up with nouns. • Sports. Even in a corpus selected using political terms, many articles are about sports, especially football. To aggrevate things, these sports words are so densely clustered that any entity that could be a sports term has many very close candidate-synonyms from these sports articles. Examples include ’advocaat’ (lawyer / football coach), ’az’ (Ministery for General Affairs / football team), and ’Dekker’ (Minister of Housing / cyclist). A better selection procedure (possibly using negative terms as well as positive ones) might avoid these problems. • Abstract concepts. Although the matching and recall of step one performs a lot better on very concrete words, the synonyms generated for 6

see http://www.cs.vu.nl/ wva/papers/pol2005 for the raw lists

16

abstract concepts include very useful indicators. For example: ’Judicial Power’ yields trial, court, trials, courts, jurisdiction, proceedings, evidence, law, inquiry, adjudication: all in all 10 direct synonyms and 12 indicators in the first 25 words (this is the lowest important nonpolitical actor in figure 2).

6

Summary and Discussion

This paper proposes a method to aid the researcher in two steps that are performed in most content analyses: the creation of a the list of categories or entities that the researcher wants to count, and the definition of these terms in the codebook or synonym list. For the first step, our method can suggest terms with a recall of 80%-90%, with the higher figure especially for concrete political actors. Precision of the method is lower with an upper bound of around 50%. This means that the method is not suited as an automatic method to generate entities since too much noise would be contained in the result. On the other hand, the high recall means that the method can be very useful to help the researcher prevent errors of ommission. For the second step, our method can suggest lists of candidate-synonyms. In the first 25 candidates, there are on average 1.5 direct synonyms and 2.5 words that indicate the presence of the entity. In contrast to the first step, performance is best on general or abstract terms, although this might be because there simply are not many synonyms for a person.

6.1

Future work

Some additions to this method are fairly straightforward and can be expected to improve results at low costs. Lemmatizing will help reduce the amount of word forms and reduce both computational complexity and data scarcity problems. POS tagging and selecting only nouns and proper names as candidate terms and synonyms can also help reducing computational complexity, plus it can be expected to solve certain homonymy problems and increase precision of the recall. Finally, filtering out documents containing sport terms will tackle the specific problem mentioned in section 5.2.

17

Another good improvement might be dealing with multiword terms. The most straightforward way to do this is presumably collocation detection based on individual and joint frequency, combined with preprocessing to replace the collocations by single tokens (such as HouseOfCommons).

6.2

Possibilities and Limitations

Although this method can aid any content analysis research for which enough documents are available, its main strenght may well lie in very quick analysis of relatively broad terms. Concepts such as ’Judicial Power’, ’European Politics’, ’Democracy’, and ’Freedom’ seem to yield very useful synonyms. This method can also work to enhance word lists such as those in the General Inquirer () or used in (). The main problem with those word lists is that it is difficult to attain high recall; allowing the computer to generate synonyms and then removing the irrelevant ones might greatly increase recall without damaging precision.

References abd B.M. Pincombe, M. L. and M. Welsh. An empirical evaluation of models of text document similarity. submitted manuscript. Banerjee, S. and T. Pedersen (2003). Extended gloss overlaps as a measure of semantic relatedness. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, pp. 805–810. Bestgen, Y. (2002). Dtermination de la valence affective de termes dans de grands corpus de textes. In Actes du Colloque International sur la Fouille de Texte CIFT’02, Nancy. INRIA. Brouwers, L. (1989). Het juiste woord: Standaard betekeniswoordenboek der Nederlandse taal (7th edition; ed. F. Claes). Antwerpen: Standaard Uitgeverij. Bryant, J. and D. Zillman (Eds.) (2002). Media Effects: Advances in Theory and Research. Mahwah, NJ: Lawrence Erlbaum. Church, K. W. and P. Hanks (1989). Word association norms, mutual information, and lexicography. In Proceedings of the 27th. Annual Meeting of the Association for Computational Linguistics, pp. 76–83.

18

Deerwester, S. C., S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science 41 (6), 391–407. Holsti, O. (1969). Content Analysis for the Social Sciences and Humanities. Reading MA: Addison-Wesley. Kamps, J. and M. Marx (2002). Words with attitude. In First International WordNet conference. Kirkpatrick, B. (1998). Rogets Thesaurus of English Words and Phrases. Harmondsworth, England: Penguin. Kleinnijenhuis, J., D. Oegema, J. de Ridder, A. van Hoof, and R. Vliegenthart (2003). De puinopen in het nieuws, Volume 22 of Communicatie Dossier. Alpen aan de Rijn (Netherlands): Kluwer. Krippendorff, K. (2004). Content Analysis: An Introduction to Its Methodology (second edition). Sage Publications. Landauer, T. K. and S. T. Dumais (1997). A solution to plato’s problem: The latent semanctic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review 104, 211–140. Landauer, T. K., P. W. Foltz, and D. Laham (1998). Introduction to latent semantic analysis. Discourse Processes 25, 259–284. Manning, C. and H. Schtze (1999). Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press. McCombs, M. E. and D. L. Shaw (1972). The agenda-setting function of mass media. Public Opinion Quarterly 36, 176–187. Miller, G. (1990). Wordnet: An on-line lexical database. International Journal of Lexicography (Special Issue) 3, 235–312. Miller, G. (1995). WordNet: a lexical database for English. New York: ACM Press. Moldovan, D. I. and V. Rus (2001). Logic form transformation of wordnet and its applicability to question answering. In Meeting of the Association for Computational Linguistics, pp. 394–401. Nakov, P., A. Popova, and P. Mateev (2001). Weight functions impact on lsa performance. In Proceedings of the EuroConference Recent Advances in Natural Language Processing (RANLP’01), pp. 187–193. Osgood, C. E., G. J. Suci, and P. H. Tannenbaum (1967). The Measurement of Meaning. Urbana IL: University of Illnois press. 19

Pazienza, M. T. (Ed.) (2003). Information Extraction in the Web Era: Natural Language Communication for Knowledge Acquisition and Intelligent Information Agents. Springer. Pennebaker, J. W., M. E. Francis, and R. J. Booth (2001). Linguistic Inquiry and Word Count. Mahwah, NJ: Lawerence Erlbaum Associates. Schrodt, P. (2001). Automated coding of international event data using sparse parsing techniques. In Annual meeting of the International Studies Association, Chicago. Stone, P. (1997). Thematic text analysis: new agendas for analyzing text content. In C. Roberts (Ed.), Text Analysis for the Social Sciences. Mahwah, NJ: Lawerence Erlbaum Associates. Stone, P., R. Bayles, J. Namerwirth, and D. Ogilvie (1962). The general inquirer: a computer system for content analysis and retrieval based on the sentence as a unit of information. Behavioral Science 7. Turney, P. D. (2002). Thumbs up or thumbs down? semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL’02), pp. 417–424. van Atteveldt, W., D. Oegema, E. van Zijl, I. Vermeulen, and J. Kleinnijenhuis (2004). Extraction of semantic information: New models and old thesauri. In Proceedings of the RC33 Conference on Social Science Methodology, Amsterdam. Vossen, P. (Ed.) (1999). EuroWordNet: a multilingual database with lexical semantic networks for European languages. Dordrecht: Kluwer. Young, M. D. (2001). Building worldviews with profiler+. In M. D. West (Ed.), Applications of Computer Content Analysis, Volume 17 of Progress in Communication Sciences. Ablex Publishing. 0.2

20

Appendix A: Stoplist The following words were exluded from the analysis: zich betreffende van jullie binnen vanuit klaar boven veeleer konden bovendien vervolgens kunnen bovenstaand volgens later buiten vooraf maar daarheen vooralsnog meer daarna voordat mezelf daarom voordien mijn daarvanlangs voorop mijner dat vrij misschien die waar mochten dit wanneer moesten doorgaand waren moeten echter

een eer weer naar eerder wegens net eerst weldra noch elke welke nogal enig wiens of enkel wij om erdoor zal omhoog eveneens zelfs omstreeks gauw zij omver geen zijne ondertussen gekund zodra ons gelijk zou onze gemogen zowat op gewoon zullen opzij haar

we had aangaande overigens hare achter precies hebben afgelopen rond heeft aldaar sedert hen alhoewel sindsdien hierbeneden alle sommige hij alleen steeds hoewel altijd tenzij hunne ander thans ikzelf anders toch inmiddels behalve toenmaals is beide tot jij ben tussen jou bent uitgezonderd jouwe

21

je juist bij vandaan kan binnenin vanwege kon bovenal verder krachtens bovengenoemd vol kunt bovenvermeld voor liever daar vooral mag daarin voorbij met daarnet voordezen mij daarop voorheen mijnent dan vooruit mijzelf de vroeg mocht dikwijls waarom moest door want moet dus was mogen

deze na eerdat weg nadat eerlang wel niet elk welk nog en wie nu enigszins wier ofschoon er wijzelf omdat even ze omlaag evenwel zichzelf omtrent gedurende zijn onder gehad zo ongeveer geleden zonder onszelf gemoeten zouden ook geweest zulke opnieuw gewoonweg zult over

aan overeind hadden aangezien pas heb achterna reeds hebt al rondom hem aldus sinds het alias slechts hierboven allebei spoedig hoe alsnog tamelijk hun altoos terwijl ik andere tijdens in anderszins toen inzake behoudens toenmalig jezelf beiden totdat jijzelf beneden uit jouw bepaald vaak

Appendix B: Technical Details Latent Semantics Analysis As described in section 2.2, LSA (Landauer and Dumais 1997; Deerwester et al. 1990) is a method for analysing the semantics of words from their usage patterns. Technically, LSA is the application of the dimensionality reduction method Singular Value Decomposition (SVD) to a matrix containing the frequency of the terms in the different documents. SVD decomposes an d × t document-term matrix M to three matrices such that M = U · S · V T , where U is a t × n matrix mapping the terms in M to the underlying ‘factors’, S is a n × n diagonal matrix containing the singular values of the original matrix, and V T is a n × d matrix mapping the documents to these dimensions. Any matrix can be losslessly reduced to these three matrixes. Then, the dimensionality reduction is done by selecting only the top m ¿ n singular values, reducing the S matrix to an m × m matrix and likewise for the other matrices. The resulting matrix, M 0 , will be the least squares approximation to the original matrix given the requirement on S. See also Landauer et al. (1998) for a very good and detailed introduction to this method.

Weighting Section 3.2 describes the weigthing choices made in the current study. These choices are more formally described below. Each cell dt(d, t) in the document-term matrix contain the normalized weighted frequencies of the term in that document. The normalization is such that each document (each row in the matrix) sums up to one. The weighting is a combination of the global weight G(t) for that term and the local weight L(d, t) for that term in that document: dt(d, t) = L(d, t) · G(t)/

X

L(d, t) · G(t)

t∈T

The local weight is the logarithm of one plus the document frequency f (d, t) of that term in that matrix, which reduces the impact of words with relatively high frequencies: 22

L(d, t) =

2

log(1 + f (d, t))

The global weight of a term t for all documents is defined as the relative conditional entropy of the documents given that term. This means that it is a reflection of the amount of information that knowing the frequency of that term in a document gives about that documents. A word such as ’the’, which will occur in almost all documents with comparable relative frequency, does not give any information about the document and will thus be assigned a very low global weight. The measure for conditional entropy is the standard measure p(d|t) · 2 logp(d|t), where the probability of a document given that word is defined as the frequency of that word in that document divided by the total frequency of that word:

H(D|t) G(t) = 1 − =1+ H(D) X f (d, t) p(d|t) = f (d, t)/

P d∈D

p(d|t) · 2 logp(d|t) 2 log|D|

d

The fact that the relative entropy is taken relative to the total entropy of the documents, which is defined as the 2 log of the number of documents, ensures that the resulting measure is between 0 (no information content) and 1 (maximal information content).

Selection of the contextual terms In order to allow the usage pattern of the target terms to determine the semantic distance as well as the co-occurence pattern, a number of contextual terms was included in the Latent Semantics Analysis as well as the target terms. These contextual terms were picked from the total list of terms by taking the terms that had the highest information content times total frequency, in other words that had the highest f (t) · G(t), with G(t) as defined above.

23

Implementation The decomposition of the resulting matrix was done using the SVDLIBC toolkit written by Doug Rohde7 , based on the SVDPACKC library written by Berry and others. The creation of the document-term matrix and the recreation of this matrix from the SVD decomposition was done using python scripts using the Numarray toolkit8 . The clustering was done by building a distance matrix using the Pycluster toolkit9 and sorting the list of words per entity on this distance10 . All code was executed on *nix machines, the SVD decomposition on a multiprocessor Solaris SPARC machine with 8GB internal memory, while the clustering and preprocessing was performed on a more run-of-the-mill linux machine with 1GB internal memory. Total execution time was in the order of magnitude of hours, with many of the self-written code very unoptimized.

7

http://tedlab.mit.edu/ dr/SVDLIBC/ http://sourceforge.net/projects/numpy 9 http://bonsai.ims.u-tokyo.ac.jp/ mdehoon/software/cluster/software.htm 10 The two toolkits are freely available and the python scripts to combine them are avaiable upon request from the author 8

24

Appendix C: Synonym lists The tables below list the first ten synonyms per entity. The order is the reverse of that in figure 2.

Non-political actors entity

word synonyms (important non-political actors) albert heijn/ahold heijn ah super edah boer albert supermarktconcern casino nutreco meurs offensief topambtenarendepartement departementenonderzoeken verwijt topambtenaar integriteit bewindslieden concludeerde ambtenaren ambtenaren ambtenaar apothekers apothekers medicinale cannabis knmg buijs kinkhoest mankell woedend perle gda wetswijziging artsen zonder grenze azg dagestan losgeld erkel ontvoerde ismail powells kissinger tbilisi mahendra papandreou belangengroeperingen groepering jordaanse militanten qaida egyptische riyad offensief ahmad hezbollah tikrit israe boeren, agrariërs boeren veerman koeien biologische natuurbeheer kippen gelderse lnv mest pluimveehouders eieren burgemeester / corps burgemeester gekozen burgemeesterscohen benoeming fractievoorzitterbestuurlijke raadsvergadering vergadering raadsleden kandidaten energiewinningsbedri energiebedrijvenez opta laurens karien bouwbedrijven schaduwboekhouding nma clement bewindspersoon heinsbroek europees parlementep europarlementariers europarlement straatsburg buitenen neelie parlementarierseurlings parlementsleden commissievoorzitter parlementarier voorzitterschaponderhandelingen luxemburg top grondwet eurocommissaris regeringsleiders europese unie eu unie lidstaten brussel gedeputeerde staten gedeputeerde statenlid provincies statenfractie maij ordening statenverkiezingen rietkerk gelderse lnv natuurbeheer motie fracties raadsleden leefbaar vergadering aangenomen gekozen namens gemeenteraad gemeenteraad fractievoorzitterraadsvergadering hoge raad raad state grlinks cu raadsvergadering motie raadsleden uitspraak fractievoorzitterdiscussie vergadering humanitaire organisa humanitaire vredesmacht ontwapenen troepenmacht mandaat gezant kofi darfoer darfur diplomaten milities vreemdelingenbeleid vertrekcentra vertrekcentrumdossiers landsadvocaat rita ambassades pardonregelingjaco immigratiedienst ind uitzetting israel israel palestijnse israelische palestijnen sharon arafat jeruzalem gazastrook vrede abbas hamas koninklijke landmachlandmacht vliegbasis baret helikopters luitenant fregatten luchtmacht kolonel majoor defensiepersoneel bevelhebber koninklijke marechau marechaussee mariniers sergeant majoor missies irakees gelegerd afmp afgevuurd kolonel defensiepersoneel koninklijke marine marine landmacht luchtmacht vliegbasis fregatten krijgsmacht helikopters twenthe orion defensiepersoneel veteranen koninkljike luchtmacluchtmacht landmacht vliegbasis marine krijgsmacht helikopters commando commandant luitenant kolonel eenheden landen landen lidstaten unie verdrag italie spanje eu brittannie navo rusland luxemburg laurus laurus konmar edah super albert heijn ah supermarktconcern casino nutreco jonnie leger leger soldaten militaire troepen gedood militair gazastrook arabische aanval strijders jeruzalem media media zender pers journalist symfonie nos commissariaatmco rso eo medy militairen militairen troepen militair militaire soldaten leger missie irakezen afghanistan as rumsfeld navo navo scheffer secretaris jaap bondgenootschap robertson annan missie afghanistan kofi militaire geruimd pluimveebedrijven besmette vervoersverboduitbraak eieren ruimen vogelpestvirus mkz besmetting pluimveehouders pluimveehouders demissionair dusver visserij waddenzee kamercommissie lnv provincie provincies sybilla rietkerk statenfractie statenverkiezingen publieke omroep publieke omroepen orkest omroep radio media hilversum mco symfonie bnn medy raad van state state rechtspraak cassatie juridische tjeenk juridisch oordeel aanwijzing civiele unaniem cultuurnota rechter rechter uitspraak geding straf rechtszaak rechters opgelegd politierechter advocaten eis zitting rechterlijke macht rechterlijke strafzaken rechtbanken rechtspraak strafproces bewijsmateriaalvooronderzoekwetboek rechtszaken berechting berechten veiligheidsraadkorea powell militaire resolutie naties verenigde staten vs washington amerikanen vn rusland vn vn veiligheidsraadnaties resolutie powell annan militaire wederopbouw amerikaans washington iran amerikaans wapeninspecteurs naties wederopbouw massavernietigingswapens annan sancties inspecteurs blix vn-veiligheidsraad veiligheidsraad resolutie raadsvergadering fractievoorzitterordening motie vergadering raadsleden discussie oudkerk wethouder wethouder ruimtelijke leefbaar wto wto cancun kyoto fischler embargo wapenembargovetorecht giscard estaing meningsverschillen solana (unimportant non-political actors) bolkestein, frits, e bolkestein eurocommissaris europarlementarier kroes barroso straatsburg buttiglione richtlijn europarlement neelie lidstaat commissievoorzitter europarlement ep europarlementariers buitenen parlementariersprodi socialist buttiglione buttiglione neelie straatsburg europarlementarier straatsburg commissievoorzitter europarlement ep portefeuille kandidatuur kroes, neelie kroes barroso buttiglione neelie (korps) mariniers mariniers kolonel majoor sergeant missies gelegerd manschappen apache afmp camp defensiestaf advocaat advocaat client mr raadsman bewijs bondscoach getuigen advocaten uitspraak davids gerechtshof inspectiedienstruimingen hobbydieren mkz vervoersverbodvogelpestvirus kalkoenen hobbydierhouders hobbyboeren pluimveesector aid aid ambassade ambassade ambassadeur diplomaten saudi iraanse diplomaat indonesische teheran osama ambassades thaise clienten client raadsman verklaringen zitting vrijspraak ontkent strafzaak rechtszaak vrijgesproken aanklager ontkende commando's commando manschappen mariniers gelegerd eenheden commandant luitenant kolonel helikopters troepenmacht stabilisatiemacht cuba cuba guantanamo bay vijandelijke bases ashcroft qaida powells hooggeplaatstedissidenten clarke woonwagenkamp xtc huiszoeking arrestaties vinkenslag huiszoekingen documenten strafbaar illegalen illegale inval onderzoekt nam nam hield legde leek bleef bracht wist toonde zette achterstand verloor nma nma bouwbedrijven energiebedrijven schaduwboekhouding bouwfraude opta ez laurens enquetecommissie karien onrechtmatig slachtoffers slachtoffers doden schadevergoeding nabestaanden aanslag getuigen gepleegd gedood autoriteiten madrid misdrijven veroordeelden uitgezeten drugssmokkel thailand thailand thaise uitzitten staatsbezoek machiel doodstraf indonesische kuijt

25

Political actors entity

word

az az buza buitenlandse bzk binnenlandse christenunie christenunie eerste kamer eerste ez ez groenlinks groenlinks kabinet balkenende kabinet i lnv lnv lpf lpf min financien financien min van defensie defensie min van justitie justitie oppositiepartijen oppositiepartij parlement, tweede ka parlement pvda pvda regering, overheid regering regeringspartijen regeringspartij vrom vrom vvd vvd aartsen, tk fv aartsen balkenende, jan-pete balkenende berg, max van den, b eerg bos, wouter, tk fv bos bot, ben, cda ministbot brinkhorst, laurens- brinkhorst de geus, aart-jan, c geus de graaf, thom, d66graaf de hoop scheffer, jascheffer dekker, sybille, vvd dekker dittrich, boris, tk dittrich donner, piet-hein, c donner halsema, tk fv halsema herben, mat, tk fv herben hirsi ali, tk hirsi kamp, henk, vvd minikamp marijnissen, tk fv marijnissen nicolaï, edzo, staat nicolai nijs, staatssecretar nijs peijs, karla, cda mi peijs remkes, vvd ministerremkes weisglas, tk weisglas zalm, gerrit, vvd mi zalm as, tk as atsma, tk atsma baalen, tk baalen bakker, tk bakker bomhoff, eduard, lpfbomhoff bommel, tk bommel buijs, tk buijs koning koning wilders, vvd, tk wilders

synonyms (important political actors) rbc heerenveen nac nec graafschap roda rkc psv jc eredivisie bot powell vn ambassadeur rusland iran colin washington hoofdstad ambassade remkes aivd veiligheidsdienst inlichtingen ministeries staatsrecht ambtenaren terrorisme terroristische terroristen oppositiepartijen sgp leefbaar halsema peiling lijsttrekker rouvoet maurice femke zetel wedstrijd derde klasse bleef punten speelde rust maakte vierde seizoen opta karien laurens noe comb bewindspersoon mtr quay res delfia fractievoorzittermotie lijsttrekker femke christenunie leefbaar zetels fracties halsema sgp vice ministerraad leider regeerakkoordfinancien gpd ii premier kok bronnen voedselkwaliteitganzen mest ruimingen hobbydieren pluimveesectormkz natuurbeheer hobbyboeren visserij fortuyn pim herben sgp eerdmans cu nawijn zetels grlinks christenunie gerrit stabiliteitspact wijn afm eichel vice pact trichet eurocommissaris correspondent militairen militaire militair knaap leger rumsfeld troepen afghanistan krijgsmacht soldaten officier verdachte donner straf gevangenisstrafveroordeeld celstraf eiste advocaat verdacht coalitiepartijen coalitiepartnerscoalitiepartner coalitiegenotenobrero frente volkspartij moties verkiezingscampagne regeringspartij europees verkiezingen kandidaat barroso europarlementarier bolkestein kroes kandidaten eurocommissaris straatsburg groenlinks grlinks fractievoorzittercu christenunie sgp fracties coalitie zetels verkiezingen onderhandelingen democratischenationale oppositie macht troepen hoofdstad verklaarde militaire vn obrero frente fol oppositieleider kamerfracties openlijk volksvertegenwoordiging coalitiepartner coalitiepartnersboris inspectie milieu ruimtelijke ordening volkshuisvesting geel provincies state nota illegale fractievoorzittergrlinks liberalen cu sgp fracties coalitie aartsen christenunie zetels fractieleider liberale jozias liberaal geert liberalen dijkstal uitspraken partijgenoot dittrich premier leider beatrix koningin voorzitterschappeter rijksvoorlichtingsdienst kok formatie dittrich martijn marcel dennis quick be jeroen mark bergh marco hoofdklasse wouter leider formatie lijsttrekker fractieleider verhagen kok halsema dittrich voorman ambassadeur colin ambassade diplomaat thailand mensenrechtenbetrekkingen regeringsleiders powell ontmoeting energiebedrijven gennip nma laurens ez bewindsman opta bouwbedrijven karien schaduwboekhouding bewindsman gestuurd apothekers demissionair robin regeerakkoordprinsjesdag oppositiepartijen regeringspartijen agt bestuurlijke antillen burgemeestersthom koninkrijksrelaties gekozen godett curacao mirna anthony jaap robertson bondgenootschap navo ovse annan kofi diplomaat diplomatieke ambassadeurs erik ronde thomas boogerd renner renners michael tijdrit tour volkshuisvesting verhagen boris regeringspartijen voorman maxime jozias regeerakkoordpartijleider regeringspartij kamerdebat hein gedetineerden terrorisme tbs casino vervolgd misdrijven bevoegdhedenadvocaten informateur oppositiepartijen linkse rosenmoller vos kamerverkiezingen duyvendak kamerdebat vlies femke rouvoet mat eerdmans nawijn hammerstein partijbestuur hilbrand marten belder freeke joost ayaan ali mohammed aartsen wilders geert politica bedreigingen theo uitspraken militairen afghanistan henk woonwagenkamp missie militaire militair troepen isaf marine lazrak eenmansfractievelzen kamervoorzitterstandpunten verkiezingscampagne boris maxime partijleider melkert atzo conventie solana lidstaat zuydewijn vetorecht giscard estaing bourbon karien collegegeld groenendaal zesde gewonnen belg hoof sven wellens richard demissionair verkeersminister puntenrijbewijs karla waterstaat haegen schultz hofstra betuwelijn bewindsvrouw netelenbos binnenlandse aivd veiligheidsdienst inlichtingen ambtenaren ministeries voordracht voorgedragen kuiper cohen kamervoorzittereenmansfractiepresidium binnenhof kamerfractie voorkeurstemmen thom politica spoeddebat boris regeerakkoordfractieleider voorman financien gerrit vice peper stabiliteitspact dittrich aartsen (unimportant political actors) samawah gelegerd muthanna camp mariniers manschappen irakees omgekomen sergeant basra ormel buijs rijpstra buma haersma fessem cat aanvoering sorgdrager parlementair rijpstra oplaat luchtenveld griffith kamerfracties buijs ormel orions korvetten haersma joost bart quick bram mulder voskamp pol jongh martijn ham heinsbroek sorgdrager rowi luns quay bewindspersonen stv rijnsb gda foreholte sneijder bouma heitinga ooijer meyde bronckhorst bosvelt vaart zenden nistelrooy mtiliga niemeyer mosselveld lopes nieuwstadt teixeira zuurman ormel lindenbergh ax kroonprins claus monarchie laurentien oranjes vorstin johan koningshuis staatsbezoek zorreguieta geert liberale liberaal ayaan aartsen bedreigingen eenmansfractiebaalen uitspraken kamervoorzitter

26

Automatic Codebook Acquisition

Recommend Documents