CHAPTER 4 EVALUATION. For testing purposes, the algorithm was implemented in PHP (5.3.0), MySQL

CHAPTER 4 EVALUATION

4.1

System Specification For testing purposes, the algorithm was implemented in PHP (5.3.0), MySQL

database (5.5.8), and Apache web server, in a localhost environment. The experiment application is basically a text box that accepts one Indonesian word, and a button to process the given word. The machine used is equipped with Intel Core i5-3317U (1.7 GHz, 4 CPUs), and a 4GB RAM.

4.2

Testing The test data/sample was collected manually from Kompas.com, one of the

biggest news companies in Indonesia. The 25 articles collected were from Kompas.com’s online article taken between 1st November 2012 and 15th January 2013, and distributed evenly between 10 news categories. Before lemmatized, the articles were parsed so it fits these conditions: 1. Minimum length of tested word is 4. 2. Numbers and special characters are truncated, leaving alphabets and stripes (‘-‘). 3. The data is supplied in a form of one word per lemmatization process.

55

56 The parsed/formatted data contains 64,372 words with an average of 6.68 characters per word, and 10,579 unique words. The data is stored in a MySQL table to ease testing process. The test will be performed on several different categories:

57 1. All / Combined category

Table 4.1 Sample test case from All / Combined Category Full

Unique

terjadi

rosmaniar

terutama

diduga

meluruh

membuktikan

rusia

mohammad

yang

batas

dihebohkan

diskriminasi

adalah

berharap

malaysia

menegaskan

peningkatan

juli

malaysia

disimpan

mikro

biji

akan

gangguan

bangsa

kuncinya

desa

ujang

persen

smyczek

nomor

langsung

desa

pirang

merebut

ditemukan

This is the combined version of all categories, and 18 cases were randomly taken and represented in a table form above ‘Full’ means that the test data is not filtered for any duplicate words; and ‘Unique’ means that the test data contains no duplicate words. 18 entries were randomly taken for the sample test cases displayed above.

58 2. Business category

Table 4.2 Sample test case from Business Category Full

Unique

terjadi

nantinya

kondisi

jasa

izin

minggu

terakhir

pembangunan

untuk

konsumer

daya

menanggung

keperluan

kecil

inklusif

eksekutif

untuk

subsidi

membidik

dibanding

usaha

umkm

bidang

kelas

memperkaya

bermotor

termasuk

mereka

bermotor

dinner

tiga

muntahiyah

cara

prospek

memberikan

layanan

Business category discusses about the development of Indonesia’s economy, fiscal & monetary, stock market development, business analyses, important figures in Indonesian business world, business planning, and opportunities.

59 3. Regional category

Table 4.3 Sample test case for Regional Category Full

Unique

timur

terkait

surat

hakim

juga

syech

surat

nomor

jalan

rangkaian

pemberhentian

masih

disampaikan

diajukan

kekaguman

sore

desember

waiwerang

tetapi

berkah

rekomendasi

islami

untuk

dicampur

jalan

effendi

garut

anak-anak

jawabannya

melanggar

ruas

berupa

bupati

setelah

tanah

mengangkut

Regional category mostly discusses about important/key events that happened in the highlighted regions, such as Sumatra, Jawa, Kalimantan, and Eastern Indonesia. The article’s news content is mostly mixed; sometimes it talks about criminality, politics, or social events.

60 4. Education category

Table 4.4 Sample test case for Education Category Full

Unique

dalam

pada

yang

terbuka

transisi

melaksanakan

sadar

undang-undang

rusak

meminta

menjadikan

mewujud

bayar

waktu

anak

berpesta

baru

seluruh

rupiah

meniadakan

oleh

dibatalkan

pihak

kurikulum

namun

ekstra

sorotan

memang

jelas

memliki

dalam

masih

mila

triliun

sekolah

baik

Education categories contain news about education development in Indonesia; including the important figures in Indonesia’s education world, scholarships, study guide, education agenda, and information.

61 5. Science category

Table 4.5 Sample test case for Science Category Full

Unique

institut

spesies

memantau

laut

diketahui

pusat

penduduk

kelautan

tidak

serius

selatan

ilmu

besar

besar

hijau

perilaku

hujan

jagielonian

rambut

menghargai

melebihi

korban

rekannya

untuk

struktur

berkumpul

bmkg

branicki

data

ekor

membuat

lancashire

gelombang

pernah

hingga

tiga

Science category mostly discusses about energy conservation, global warming, general science, astrology, archeology, biology, laboratory tests, scientific phenomena and other news that can be categorized as science.

62 6. Sports category

Table 4.6 Sample test case for Sports Category Full

Unique

digelar

ivan

melakukan

peringkat

sozonov

merebut

menjabat

bertemu

solidarity

roth

hasil

meningkatkan

pekan

abdul

jadi

bulu

dengan

kejuaraan

punggung

rafael

akan

tangkis

adalah

kalinya

mengalahkan

melaju

salah

menjabat

kurniawan

barunya

kualifikasi

untuk

slam

secara

sementara

senin

Sports news discusses about events that occur in Indonesia regarding major, popular sports category such as racing, tennis, badminton, celebrity sports, and other sports.

63 7. International category

Table 4.7 Sample test case for International category Full

Unique

tawaran

jendela

warga

mengatasi

mengirim

inggris

tentang

abou

pengandilan

pakistan

alasan

berniat

tahun

penyelamat

militer

khovanskoye

mengimpor

teroris

palsu

peralatan

tenda-tenda

nyawa

kata

kondisi

udara

mengkaji

oleh

diabaly

rights

silvio

warga

beku

mereka

senapan

angin

entv

International category mostly discusses about important events and/or issues that occur outside Indonesia. The category also brought up important international figures, unique stories, and tragedies.

64 8. ‘Oasis’ category

Table 4.8 Sample test case for Oasis category

Oasis

category

Full

Unique

yang

ternama

kota

mesti

keresidenan

tergenang

tanah

menghadirkan

dengan

tawanan

relatif

khususnya

menjadi

memorakporandakan

tinutuan

press

tambah

ngeri

hostel

bertubi-tubi

mampu

risiko

kepala

terumbu

film

tepung

andalan

esthy

kami

menyengat

yoseph

nilai

tungka

larangan

meningkat

namanya

provide

complementary

and

entertainment

news

that

happens/exists in Indonesia. This category discusses about poetries, short stories, novels, serial stories, and arts.

65 9. Megapolitan category

Table 4.9 Sample test case for Megapolitan category Full

Unique

biasanya

pencarian

bukan

calon

keberadaan

desa-desa

tidak

kegiatan

fatmawati

tambang

juga

di-mark

matinya

nilai

diperbolehkan

bernomor

anak

supriyati

kepada

penyelewengan

masalah

keluarga

menegaskan

salah

balik

sisi

tanggul

pinggir

banjir

pantas

sekarang

lakukan

untuk

berpendapat

jokowi

permohonan

Megapolitan category mostly discusses about lifestyle, crime story, urban life, and important events and/or issues that happen in the major cities in Indonesia. Mostly, news about trending celebrities are posted here.

66 10. Travel category

Table 4.10 Sample test case for Travel category Full

Unique

hidup

melahirkan

agar

fotografer

selamanya

hhhh

berbau

lendang

menjadi

konservasi

lady

membelenggu

dilestarikan

hal-hal

tadi

denting

bebas

afrika

serius

rata

semua

menimbulkan

ratri

seolah

pelaku

karakter

brazil

orang-orang

setiap

menulis

penitipan orangorang keringat

pelak pikul karana

Travel category highlight news that discusses about traveling. Some of the subcategories are travel stories, food stories, travel tips, hotel stories, and stories in which Indonesian take part in the international travel events.

67 11. National category

Table 4.11 Sample test case for National category Full

Unique

pembunuhan

hrwg

setelah

citra

larangan

prasarana

senilai

terdiri

aceh

digambarkan

diminta

amir

korupsinya

hudep

sebagai

berwarna

menuturkan

maka

sudah

mengalir

minustah

meliputi

setelah

urus

juga

tentang

tahu

mempercepat

agung

kubu

bathoegana

tanya-tanya

hujan

widjojanto

partai

tengah-tengah

National category provide news that occurs especially in Indonesia. The news content itself may be mixed; there may be articles that discuss about politics, economy, business, crime stories, and issues about celebrities.

68 4.2.1

Validity In analyzing the test data, there are several constraints/limitations to which

lemmatization process are considered a success, which are considered an error/fault, and specific cases that are out of the current algorithm’s scope. A lemmatization is considered successful, if a lemma is correctly produced from the input word. There are some cases that a lemma is produced incorrectly, which will fall to the error category. Out-of-scope cases are considered invalid or unqualified; therefore neither counted as a failure nor success. These out-of-scope cases are: 1. Proper Nouns and Abbreviations, which include people, area, or company names (Microsoft, Bandung, PT.KAI, etc.). The main reason this is included as out of scope, because they do not exist in the dictionary. 2. Foreign Word, which means other words not in Indonesian language. Same as point 1, they are not listed in Indonesian dictionary. 3. Infix, an affix that is inserted inside a word. For example, the infix ‘-er-‘ for ‘gigi’ which produces ‘gerigi’. Words that contain infix are already listed as lemma; therefore infix removal procedure is not supported by this algorithm. 4. Non-standard Words and Affixes, which mean words that is not defined in Indonesian Dictionary, or slang words and affixes. A few examples of these words would be ‘nyambung’, ‘gue’, ‘pikirin’ with its ‘–in’ suffix.

Lemmatization errors can be classified into a few categories: 1. Overlemmatized: This term is similar to overstemming; Affix removal is performed too much/extensively, such that the produced lemma is not as

69 expected. For example, in ECS’s case of overstemming, ‘penyidikan’ to ‘sidi’, where the correct one should be ‘sidik’. 2. Underlemmatized: This term is similar to understemming; Affix removal is performed too few, such that the produced lemma is not as expected. In ECS’s case, ‘mengalami’ to ‘alami’, where the correct one should be ‘alam’. 3. Incorrect Rule: In this case, the affix was incorrectly removed because of ineffective or incorrect rule. For example, ‘mengatakan’ may become ‘katak’, by removing ‘-an’ suffix, and ‘meng-‘ prefix.

4.2.2

Process The test algorithm will fetch all parsed data, and process them one by one. The

results are saved in a separate table. When the input word is immediately returned as lemma, the algorithm will return “input_is_lemma” exception message; this does not affect the result in any way. When the algorithm fails to obtain the lemma from input word, it will return the original word, however with an exception message “lemma_not_found”. However, this does not mean that all results produced with an exception message is classified as error; the message can also indicate proper nouns and foreign words. After successfully storing the process results to the database, a manual inspection is done to analyze lemmatization errors. The algorithm itself does not know when it is overlemmatizing/underlemmatizing the input word; when it finds a lemma, then it will be returned as success. These cases need to be classified manually. Improvements were made in order for the lemmatization algorithm to perform better. The most notable improvement from the previous algorithm is lemma phrase checking. There are phrases that are considered as a lemma, for example, ‘tanggung

70 jawab’. These lemma phrases, when given a confix, will be joined together and become one word. e.g. the lemma phrase ‘tanggung jawab’, when given a ‘per- -an’ confix, will result in ‘pertanggungjawaban’. This case is not handled by the previous stemming algorithm, because it consists of more than one word. Other minor improvements include rule precedence, rule order, and recoding path optimization.

4.2.3

Results Below shown a sample result from the testing process:

Table 4.12 Sample Result from Business Category input

output

time (s)

Issue

perancis

perancis

0.00469

lemma_not_found

gencar

gencar

0.00184

investasi

investasi

0.00219

semakin

semakin

0.0032

bersemangat

semangat

0.00465

untuk

untuk

0.00339

berinvestasi

investasi

0.01076

indonesia

Indonesia

0.00346

terutama

utama

0.00447

dalam

dalam

0.00308

penjualan

jual

0.00558

pelaku

laku

0.00244

usaha

usaha

0.00139

perancis

perancis

0.00779

melihat

lihat

0.00438

perekonomian

ekonomi

0.00701

cukup

cukup

0.00341

baik

baik

0.00337

menjanjikan

janji

0.0057

lemma_not_found

71 These results were captured specifically from the business category, and duplicate entries are ignored. Here the test results are summarized based on the categories and conditions specified earlier:

Table 4.13 Test Result Summary FULL

Category

UNIQUE

T

V

S

E

P

T

V

S

E

P

Business

6344

5627

5550

77

0.98632

1868

1580

1559

21

0.98671

Regional

6470

4802

5846

81

0.98313

1213

1011

995

16

0.98417

Education

4165

5927

3598

32

0.99460

868

637

623

14

0.97802

Science

6246

5504

5398

73

0.98674

874

643

630

13

0.97978

Sports

6231

3242

5522

42

0.98705

838

608

604

4

0.99342

International

10953

3630

9917

75

0.97934

2037

1593

1575

18

0.98870

Megapolitan

3998

5471

3214

28

0.99488

610

302

297

5

0.98344

National

5499

5564

4764

38

0.99317

559

326

324

2

0.99387

Oasis

6087

9992

5462

42

0.99580

820

528

524

4

0.99242

Travel

8379

7502

7457

45

0.99400

892

611

607

4

0.99345

All

64372

57261

56728

533

0.99069

10579

7839

7738

101

0.98712

Where: T = Total data count V = Valid test data count S = Successful lemmatization E = Error / Failures P = Precision; obtained by computing The average processing time for each word is 0.00411 seconds. This average time is obtained by calculating:

72 Where n is the total data count in test result, and time(i) is the processing time taken to lemmatize a word, taken from the ‘time’ column (the table format follows table 4.7) on the ith row of the result table.

4.2.4 Errors From the 533 cases of failure, 421 of them are overlemmatization cases; 54 of them are underlemmatization cases; and the rest (58 cases) are incorrect rule application, or order. Here the error analysis are presented below:

Table 4.14 Error Analysis Issue

Sample Case

Overlemmatized

berupa => upa kebijakan => bija

Underlemmatized

meminimalisasi (not lemmatized)

Incorrect Lemma

berilmu => beril

‘berupa’ is mistakenly overlemmatized because, referring to affix removal rule 1 (on table 3.1) the default output is produced by always removing ‘ber-‘ prefix first, and the ‘be-‘ will be reserved as a recoding path. Before recoding is performed, the word ‘upa’ (which means present/gift) is found in the dictionary lookup process, and thus produces an overlemmatized result. It is possible to swap the default output with its recoding path, i.e. ‘be-‘ will be removed first, and ‘ber-‘ will be reserved as recoding path instead and thus produces the expected output ‘rupa’. However, by swapping the

73 recoding path, new cases arise; such as ‘berombak’ will be overlemmatized to ‘rombak’, while the expected output is ‘ombak’. Same goes for ‘berubah’, which will be mistakenly lemmatized to ‘rubah’. ‘kebijakan’ is overlemmatized to ‘bija’ because in the derivational suffix removal process, ‘-kan’ is detected, not knowing that the ‘-k’ is actually part of the lemma, and the word ‘bija’ (which means seed) exists in dictionary; therefore produces overlemmatized result. ‘meminimalisasi’ is underlemmatized because the lemmatization algorithm does not process ‘-isasi’ suffix; and therefore the original word/input is returned, and a LEMMA NOT FOUND

error is produced.

‘berilmu’ is incorrectly lemmatized to ‘beril’, because of rule precedence / step execution order. The ‘-mu’ fragment is identified as a possessive pronoun, and the word ‘beril’ (a type of mineral; Be3Al2Si6O18) exists in dictionary therefore produces error.

4.3

Demo Implementation The algorithm was also implemented to a simple web application, built in PHP

and with MySQL database. The purpose of this demo is mainly for illustrative demo, to directly input the desired word without preparing the batch files. Basically, an input is supplied to the app, and an output will be shown as result.

74

Figure 4.1 Main Display of Application

This demo application only accepts one Indonesian word; if there are characters other than alphabets or stripes (‘-‘), then the word will not be processed. When a valid word is supplied to the demo app, the word will be immediately processed. When a lemmatization process is successful, then Figure 4.2 will be shown; however when the lemmatization process returns error, such as lemma not found, then Figure 4.3 will be shown instead.

75

Figure 4.2 Example of Successful Output

On the output screen, the processing time taken to lemmatize the input word is also shown, and a convenience link to lemmatize another input word.

Figure 4.3 Example of Failed Output

76 When the lemmatization process fails, the processing time is also displayed to represent how much time it takes to process a faulty input. The processing time is obtained by calculating the time difference between the start time, which is marked right after the input is supplied to the lemmatizer, and the stop time, which is marked when the result is produced from the lemmatizer.

CHAPTER 4 EVALUATION. For testing purposes, the algorithm was implemented in PHP (5.3.0), MySQL

Recommend Documents