The Comparation of Distan ce-based Similarity Measure to Detection of Plagiarism in Indonesian Text Tari Mardiana, Teguh Bharata Adji, and Indriana Hidayah Department of Electrical Engineering and Information Technology Universitas Gadjah Mada, Yogyakarta-Indonesia
[email protected],
[email protected],
[email protected]
ABSTRACT. The accesible loose information through the Internet leads to plagiarism activities use the copy-paste-modify practice is growing rapidly. There have been so many methods, algorithm, and even softwares that developed till this day to avoid and detect the plagiarism which can be used broadly unlimited on a certain subject. Research about detection of plagiarism in Indonesian Language develop day by day, although not significant as English Language. This paper proposes several models of distance-based similarity measure which could be used to assess the similarity in Indonesian text, such as Dice’s similarity coefficient, Cosine similarity, and Jaccard coefficient. It implemented together with Rabin-Karp algorithm that common used to detect plagiarism in Indonesian Language. The analysis technique of plagiarism is fingerprint analysis to create fingerprint document according to n-gram value that has been determined, then the similarity value will be counted according to the same number of fingerprint between texts. Small data text about Information System tested in this case and it divided into four kinds of text document with some modified. First document is original text, second is 50% of original text adding with 50% of another text, third 50% original text modified using sinonym and paraphase, fourth some position of text in original text changed. From the experimental result, cosine similarity show better performance in generating value accuracy compared to the dice coefficient and Jaccard coefficient. This model is expected to be used as an alternative type of statistical algorithms that implement the n-grams in the process especially to detect plagiarism in Indonesian text.
Keywords: Fingerprint, Indonesian, Plagiarism, Similarity, Text
1
Introduction
Information retrival through search engine grows rapidly as well as the growth of Internet usage in daily life. Information application in a free way sometimes create a higher fraud cases, such as plagiarism. A duplication activity or plagiarism is an act of duplicating some or whole parts of people's work especially written text without cit-
ing any notes of source (copyright) completely. In academic world, plagiarism is a common case for students do any assignments, reports, paper, even thesis that should be done by their own words instead of copy-paste modify. There are still a lot of students who do not understand the difference between citation and plagiarism so that they do not see the edge between those things. Plagiarism practice is an unvoidable thing both in academic and daily lives, but it could be avoided by early detection based on the similarity level in the text, we could see in the grammar structure and the writing style. Plagiarism detection through the right similarity measure is not only leading to excellent information but also should be great for lessen budget and time process. In the other hand, similarity measure technique could be divided by three kind of it [1], which are: · Distance-based similarity measure, is a technique built for measuring the similarity level between two objects from geometric distance side of variables within those two objects. For example: Minkowski Distance, Manhattan/City block distance, Euclidean distance, Chebyshev distance, Jaccard distance, Dice’s Coefficient, Cosine similarity, Hamming distance, and Soundex distance. · Feature-based similarity measure, is a technique that built for measuring the similarity level by representing object into the feature that would like to compare, it is used to classify or pattern matching pictures and texts. · Probabilistic-based similarity measure, is a technique built to measure the similarity level by representing two set of objects that are being compared in a probability form. For example: Kullback Leibler Distance and Posterior Probability. This paper will discuss some measurement statistic of similarity level that could be implemented in analysis technique of plagiarism. The technique that will be discussed is fingerprint analysis uses Rabin-Karp algorithm. The objective of this experimental that it could be used the reference of the right statistic usage by matching along with the algorithm and process within it, including the use of n-gram in hash function. This research wished to give the right parameter value to be applied in the next research but not discussing about the complexity of time processing used.
2
Plagiarism Detection
In this section, we will explain about plagiarism analysis technique and algorithm used in research and some related work in similarity measure. 2.1
Rabin-Karp algorithm
There are several techniques that could be used to analyze plagiarism, they are substring matching, keyword similarity, and fingerprint analysis [2]. Fingerprint Analysis is a very often used analysis approach for plagiarism with dividing documents into sequence principle called chunks. The pattern in documents will be changed with function hash as values represent string within them. Each string in the text will be changed using hash function for creating fingerprint document according to n-gram
value that has been determined, then the similarity value will be counted based on the same fingerprint number between texts. The approach is used in this technique such as Rabin-Karp algorithm and Winnowing [3]. Rabin-Karp algorithm is a multiple pattern search algorithms that are very efficient for searching with the pattern of many [4]. This algorithm is often used for detection of pagiarism, especially in the Indonesian language text documents[5][6][7]. Disadvantages of this algorithm is a complex computational requirements and processing time that tends to be longer due to long influenced the content of the text document to be processed, starting from the stage of pre-processing and stemming that goes in it. However, the level of accuracy that is produced by this algorithm is high enough to detect plagiarism documents. Rabin Karp algorithm using a hash function that simplifies the method to avoid the complexity of processing time in most cases. The principle of this algorithm is to streamline the examination only if the text is in the process are similar as in the pattern rather than conduct an examination of every position of the text when the pattern matching [7]. Hash function is a way to create a fingerprint of a variety of data inputs. Hash function will replace or transposing data to create the fingerprint, which is called the hash values. Hash value is described as a short string consisting of letters and numbers that look random (binary data written in hexadecimal notation). Forming hash value from string i and base d written in general as follow:
H (i) = i[0]d n -1 + i[1]d n - 2 + .... + i[n - 1]
(1)
For example pattern "LUAS" with a base = of 10 have hash value: Hash(luas) = [(107*103) + (117*102) + (97*101) + (115*100)] = (107000+11700+970+115) = 119785 In Rabin-Karp algorithm, hashing used for shifting substring search. If two strings are equal between the search string (m) with substring in the text (n), their hash values must equal too. Efficient calculation of the hash value at the time of the shift will affect the performance of this algorithm[8]. Pseudocode of Rabin-Karp algorithm can be written as follows: function Rabin Karp(string s[1..n], string sub[1..m]) hsub := hash(sub[1..m]) for i from 1 to n-m+1 if hs = hsub if s[i..i+m-1] = sub return i hs := hash(s[i+1..i+m]) return not found Fig. 1. Pseucode Rabin-Karp Algorithm
2.2
Distance-based Similarity Measure
Similarity is a level comparison of similarity percentage between documents that are tested. The similarity number of percentage would be influenced by the similarity level of the tested document. The bigger similarity percentage the more similar the text would be[5]. Distance based-Similarity Measure is a technique built to measure the similarity level two objects by geometric distance side of variables include within the two objects [6]. Many theories assume that nearby objects would be more similar than far distant objects. Similarity measure (s) can be differentiated by distant measure (d) using quotation:
s = 1- d
(2)
The most common unsimilarity counting process is leading to the distant measurement in metric space. in a information retrival system, the example of this approach includes the models or methods such as Minkowski Distance, Manhattan/City block distance, Euclidean distance, Chebyshev distance, Jaccard distance, Dice’s Coefficient, Cosine similarity, Hamming distance, Levenshtein Distance, and Soundex distance [6]. The application of of similarity measurement model is not only used to detect the similar words but also able to use in chemical informatics [9]. 2.2.1 Dice’s Coefficient A most common used measurement model to calculate similarity value with n-gram approach. Dice coefficient defines similarity level as twice as many common entity (nt) divided by the two entities as total that are tested (nx+ny)[1]. The result value has range of 0 till 1 with coefficient 1 shows identical vector, and 0 equal to orthogonal vector [11]. =
(3)
Dice’s similarity has the same form with F1-score, with monotonic nature within Jaccard similarity[9]. The other name of this model is Czekanowski or Sørenson coefficient. Measurements on the similarity value of research [3][6][12] using Dice's coefficient, which simply uses the same hash calculation amount divided by the total number of hash in the document. 2.2.2 Cosine similarity It is also known as Ochiai coefficient. It is a similarity measurement model based on vector that is commonly popular in the text mining and information retrieval. This approach compares string that has been changed into space vector so that cosinus in the Euclidean principle could be used to measure the similarity level [1]. =
=
⃗
⃗. ⃗
⃗
(4)
2.2.3 Jaccard Coefficient Jacard Coefficient or well known as Tanimoto coefficient as well, is a very common model to use in chemical informatics, whether to compare similarity, unsimilarity, and the distance between data set. The concept of this model is uses the ratio of the intersecting set to the union set as the measure of similarity [13]. ( , )=
| ∩ |
+ −| ∩ |
(5)
where A and B represents the number attribute in each object, and | ∩ | is intersection set. The benefit of using this model is it could show the close relation between two set data efficiaently without redudancy among the used data. This model will show the high precision result on the system and database in a smaller scale than implementing it on search page type with different service[14]. 2.3
Similarity using Fingerprints
The most common similarity measurement is calculated based on the document fingerprint. In Example given two text A and B below: Text A à Suka itu belum tentu cinta Text B à Saya suka kamu Forming hash value from texts using hash function, and get C as matching hash value between A and B.
Fig. 2. Forming hash value from text Simple similarity between Text A and B calculated using the model of distance-based similarity measure based on equations (3), (4), dan (5) as follows. -
Dice coefficient Similarity = 2C A+B
= 8 / (21 + 11) = 0.25
-
Cosine coefficient similarity = C = 4 / Ö(21 ´ 11) = 0.263 Ö (A ´ B)
-
Jaccard coefficient similarity = C = 4 / (21 + 11 – 4) = 0.142 A+B–C
Advantage in using hash values to search similarity between two text is string matching can be done regardless of value position.
3
Experimental Method
To determine the best models to detect plagiarism based on word similarity measurement, there are some process flow and test scenarios. Three types of models of distance-based similarity measure will be implemented as the statistics in the Rabin-Karp algorithm. Data text about Information System tested in this case and it divided into four kinds of text document with some modified. First document is original text, second is 50% of original text adding with 50% of another text, third 50% original text modified using sinonym and fourth some position of text in original text changed. Example of text used will be presented as follows: a. Original Text Sistem Informasi (SI) adalah kombinasi dari teknologi informasi dan aktivitas orang yang menggunakan teknologi itu untuk mendukung operasi dan manajemen.Dalam arti yang sangat luas, istilah sistem informasi yang sering digunakan merujuk kepada interaksi antara orang, proses algoritmik, data, dan teknologi. Dalam pengertian ini, istilah ini digunakan untuk merujuk tidak hanya pada penggunaan organisasi teknologi informasi dan komunikasi (TIK), tetapi juga untuk cara di mana orang berinteraksi dengan teknologi ini dalam mendukung proses bisnis. Ada yang membuat perbedaan yang jelas antara sistem informasi, dan komputer sistem TIK, dan proses bisnis. Sistem informasi yang berbeda dari teknologi informasi dalam sistem informasi biasanya terlihat seperti memiliki komponen TIK. Hal ini terutama berkaitan dengan tujuan pemanfaatan teknologi informasi. Sistem informasi juga berbeda dari proses bisnis. Sistem informasi membantu untuk mengontrol kinerja proses bisnis. Sistem informasi adalah suatu sistem di dalam suatu organisasi yang mempertemukan kebutuhan pengolahan transaksi harian, mendukung operasi, bersifat manajerial dan kegiatan strategi dari suatu organisasi dan menyediakan pihak luar tertentu dengan laporan-laporan yang diperlukan.
Fig. 3. Original Text
b. Modification 1: 50% of original text adding with 50% of another text. Some sentences adding from another text but in the same field. Sistem Informasi (SI) adalah kombinasi dari teknologi informasi dan aktivitas orang yang menggunakan teknologi itu untuk mendukung operasi dan manajemen.Dalam arti yang sangat luas, istilah sistem informasi yang sering digunakan merujuk kepada interaksi antara orang, proses algoritmik, data, dan teknologi. Dalam pengertian ini, istilah ini digunakan untuk merujuk tidak hanya pada penggunaan organisasi teknologi informasi dan komunikasi (TIK), tetapi juga untuk cara di mana orang berinteraksi dengan teknologi ini dalam mendukung proses bisnis. Alter berpendapat untuk sistem informasi sebagai tipe khusus dari sistem kerja. Sistem kerja adalah suatu sistem di mana manusia dan/atau mesin melakukan pekerjaan dengan menggunakan sumber daya untuk memproduksi produk tertentu dan/atau jasa bagi pelanggan. Sistem informasi adalah suatu sistem kerja yang kegiatannya ditujukan untuk pengolahan (menangkap, transmisi, menyimpan, mengambil, memanipulasi dan menampilkan) informasi. Dengan demikian, sistem informasi antar-berhubungan dengan sistem data di satu sisi dan sistem aktivitas di sisi lain. Sistem informasi adalah suatu bentuk komunikasi sistem di mana data yang mewakili dan diproses sebagai bentuk dari memori sosial. Sistem informasi juga dapat dianggap sebagai bahasa semi formal yang mendukung manusia dalam pengambilan keputusan dan tindakan.
Fig. 4. Modified with adding sentences from another text
c. Modification 2: 50% original text using sinonym and paraphase. Word in bold is modification using synonim such as kombinasi replaced by perpaduan. Sentences in bold italic is modification using own words. Sistem Informasi (SI) merupakan perpaduan dari teknologi informasi dan aktivitas manusia yang menggunakan teknologi tersebut untuk mendukung operasi dan manajemen. Istilah sistem informasi dalam pengertian yang lebih luas sering digunakan untuk merujuk kepada interaksi antara orang, proses algoritmik, data, dan teknologi. Pengertian lain sistem informasi adalah suatu sistem dalam sebuah organisasi yang mempertemukan kebutuhan pengolahan transaksi harian, mendukung operasi, bersifat manajerial dan kegiatan strategi dari suatu organisasi serta menyediakan pihak luar informasi tertentu melalui laporan-laporan yang diperlukan.
Fig. 5. Modified with synonim and paraphase
d. Modification 3: Change some sentences position of 50% original text without change the meaning of text.
Sistem informasi digunakan untuk merujuk tidak hanya pada penggunaan organisasi teknologi informasi dan komunikasi (TIK), tetapi juga untuk cara di mana orang berinteraksi dengan teknologi ini dalam mendukung proses bisnis. Dalam pengertian ini, istilah ini adalah suatu sistem di dalam suatu organisasi yang mempertemukan kebutuhan pengolahan transaksi harian, mendukung operasi, bersifat manajerial dan kegiatan strategi dari suatu organisasi dan menyediakan pihak luar tertentu dengan laporan-laporan yang diperlukan. Dalam arti yang sangat luas, istilah sistem informasi yang sering digunakan merujuk kepada interaksi antara orang, proses algoritmik, data, dan teknologi. Sistem informasi membantu untuk mengontrol kinerja proses bisnis.
Fig. 6. Modified with change some sentences position
4
Experimental Results and Analysis
Testing is done by comparing the text files with each other by changing n-gram value used in the algorithm. The recomendation selected base is 25 or higher which produces a constant value of the similarity. Original text will be compared with another modified texts and measure the similarity using Dice, Jaccard, and cosine coefficient. Table 1. Comparation similarity measure result in indonesian texts (n-gram = 2) Text
Original
Modif 1
Modif 2
Modif 3
Text
Similarity Measure
Original
Modif 1
Modif 2
Modif 3
Dice's
100
87.36
85.02
94.74
Jaccard
100
77.56
73.93
90.00
Cosine
100
87.37
85.40
94.84
Dice's
87.36
100
77.48
85.06
Jaccard
77.56
100
63.24
74.00
Cosine
87.37
100
77.96
85.23
Dice's
85.02
77.48
100
86.17
Jaccard
73.93
63.24
100
75.71
Cosine
85.40
77.96
100
86.27
Dice's
94.74
85.06
86.17
100
Jaccard
90.00
74.00
75.71
100
Cosine
94.84
85.23
86.27
100
Table 1 is a test scenario using n-gram = 2 to forming hash value according gram set. For the similarity measure used can be seen that the model with the cosine similarity distance measurements showed better performance in generating value accuracy, give small difference accuracy from dice’s coefficient. This model is expected to be used
as an alternative type of statistical algorithms that implement the n-grams in the process. The result between Original text compared with modified text by change sentences position got the best result above 90% high accuracy. Basically, with the fingerprint similarity search based on the corresponding hash value, so it does not depend on the position of the words in the search for similarity called position independence [12]. Table 2. Comparation similarity measure result in indonesian texts (n-gram = 3) Text
Text
Similarity Measure
Original
Modif 1
Modif 2
Modif 3
Dice's
100
68.31
69.91
86.82
Original
Jaccard
100
51.87
53.76
76.70
Cosine
100
68.35
71.14
87.47
Modif 1
Modif 2
Modif 3
Dice's
68.31
100
56.49
67.22
Jaccard
51.87
100
39.37
50.63
Cosine
68.35
100
57.89
68.05
Dice's
69.91
56.49
100
76.02
Jaccard
53.76
39.37
100
61.31
Cosine
71.14
57.89
100
76.18
Dice's
86.82
67.22
76.02
100
Jaccard
76.70
50.63
61.31
100
Cosine
87.47
68.05
76.18
100
Table 2 shows the results of testing with test n-grams using value = 3. In this test, it can be seen that the similarity accuracy decrease compared table 1 using n=2. Original text compared with texts modified 1 and 2 give accuracy above 70%. While the Jaccard coefficient shows a drastic decrease in accuracy as the value of n-grams higher. Jaccard has a weakness when it comes to measure certain word [14]. Similarity value is influenced by the size of the small n value. The smaller the value of n-grams are used, the higher the accuracy of the similarity of the text file. This research has proven the opinion that similar research has been done previously [5][6][8] that the accuracy of the similarity is affected by the value of n-grams are used. 5 Conclusion Based on the above research, ranging from the design stage, to the analysis of test results, model of distance-based similarity measure can be implemented successfully with Rabin-Karp. Cosine similarity show better performance in generating value accuracy with some scenario modified text compared to the Dice coefficient and Jaccard coefficient. It can achive above 90% high accuracy between Original text compared with modified text by change sentences position because similarity search based on the corresponding hash value. Simple cosine model is expected to be used as an
alternative type of statistical algorithms that implement together with character ngrams as parameter in the process especially to detect plagiarism in Indonesian text. The smaller the value of n-grams are used, the higher the accuracy of the similarity of the text file. Accuracy value only used as a reference to detect plagiarism, but we need more aspect to investigate it like writing style, citation uses, and others.
6 References 1. B. Zaka, “Theory and Applications of Similarity Detection Techniques,” Graz University of Technology , Austria, Institute for Information Systems and Computer Media (IICM), 2009. 2. B. Stein and S. M. zu Eissen, “Near Similarity Search and Plagiarism Analysis,” presented at the The 29th Annual Conference Of the German Classification Society (GfKI), Berlin, 2006. 3. A. Wibowo, K.. Sudarmadi, and R. Barmawi, “Comparison between fingerprint and winnowing algorithm to detect plagiarism fraud on Bahasa Indonesia documents,” Inf. Commun. Technol. ICoICT 2013 Int. Conf., pp. 128–133, Mar. 2013. 4. R. M. Karp and M. O. Rabin, “Efficient randomized pattern - matching algorithms,” IBM J. Res. Dev.-Math. Comput. Arch., vol. 31, no. 2, pp. 249–260, Mar. 1987. 5. E. Nugroho, “Perancangan Sistem Deteksi Plagiarisme Dokumen Teks dengan Menggunakan Algoritma Rabin Karp.” Jurusan Ilmu Komputer, Universitas Muhammadiyah Malang, 2011. (Unpublished) 6. Salmuasih and A. Sunyoto, “Implementasi Algoritma Rabin Karp untuk Pendeteksian Plagiat Dokumen Teks Menggunakan Konsep Similiarity,” Semin. Nas. Apl. Teknol. Inf. SNATI 2013, pp. 23–28, Jun. 2013. 7. A. Kurniawati, L. Wulandary, and I. W. S. Wicaksana, “The Comparation of Similarity Detection Method on Indonesian Language Document,” 2nd Int. Conf. Soft Comput. Intell. Syst. Inf. Technol., pp. 285–289, 2010. 8. A. M. Surahman, “Perancangan Sistem Penentuan Similarity Kode Program Pada Bahasa C Dan Pascal Dengan Menggunakan Algoritma Rabin-Karp,” J. Sist. Dan Teknol. Inf., vol. 1, no. 1, 2013. 9. J. Barnard, “Chemical Structure Representation and Search Systems,” presented at the Chemical Informatics Software & Consultancy Services, Sheffield, UK, 18-Nov-2003. 10. B. O’Connor, “F-scores, Dice, and Jaccard set similarity,” http://brenocon.com/blog/2012/04/f-scores-dice-and-jaccard-set-similarity/, 11-Apr-2012. [Online]. Available: http://brenocon.com/blog/2012/04/f-scores-dice-and-jaccard-setsimilarity/. [Accessed: 03-Mar-2014]. 11. “Approximate String Matching,” http://cs.uef.fi/~zhao/Link/Similarity_strings.html. [Online]. Available: http://cs.uef.fi/~zhao/Link/Similarity_strings.html. [Accessed: 04-Mar2014]. 12. D. Purwitasari, P. Y. Kusmawan, and U. L. Yuhana, “Deteksi keberadaan Kalimat Sama sebagai Indikasi Penjiplakan dengan Algoritma Hashing berbasis N-Gram.” Surabaya : Institut Teknologi Surabaya, 2010. 13. “Discussion of Similarity Metrics: Jaccard / Tanimoto Coefficient,” Similarity Metrics. [Online]. Available: http://mines.humanoriented.com/classes/2010/fall/csci568/portfolio_exports/sphilip/tani.ht ml. [Accessed: 04-Mar-2014].
14. S. Niwattanakul, J. Sinthongchai, E. Naenudorn, and S. Wanapu, “Using of Jaccard Coefficient for Keywords Similarity,” presented at the Proceedings of the International MultiConference of Engineers and Computer Scientists 2013 (IMECS 2013), Hong Kong, 2013, vol. 1.