Information Retrieval (Temu Kembali Informasi)
Search Engine Nur Hayatin Teknik Informatika-Universitas Muhamadiyah Malang 2016
Outline • Definisi • Arsitektur • Crawler • Processing Text • Indexing • Retrieval & Perankingan
Search Application • Search is needed everywhere! All major operating systems have embedded searching. • Tujuan : mendapatkan dokumen yang relevan dari koleksi dokumen sesuai dengan kebutuhan user.
Search Application Macam aplikasi pencarian : • Sistem pencarian file • Perpustakaan digital • Web search • Etc.
Search Application • Sistem pencarian file (Directory & File System Search) : provides search and browsing facilities for files stored on a local hard disk and possibly on disks connected over a local network. • Textual data rewrite entire files when a user saves changes.
Search Application • Perpustakaan digital (Digital Library) : sistem pencarian pustaka berbasis komputerisasi. Pustaka meliputi : newspaper articles, medical journals, maps, atau buku. • Pencarian biasanya dilakukan berdasarkan : authors/penulis, judul, dan tanggal.
Search Engine • Search engine atau mesin pencari istilah yang biasa digunakan untuk menyebut web search. • adalah sebuah mesin yang mampu mengenali halaman web yang mengandung keyword. • Contoh mesin pencari yang populer : google.com yahoo.com bing.com etc.
Search Engine
Search Engine (Mesin Pencari)
Search Engine • Perbandingan hasil pencarian dari dua mesin pencari (Bing vs Google)
Arsitektur Search Engine
Crawling Web Graph
Retrieving & Ranking Feedback
Yang perlu dianalisa • Kecepatannya : berhubungan dengan parallel processing, brp byk mesin yg dilibatkan. Paralel processing digunakan dalam mesin pencari untuk mempercepat waktu proses. • Bagaimana web crawler bekerja : periodically downloading, etc. • Perankingan : dapat dilihat dari : konten halaman web (sebagian atau keseluruhan), hyperlink, karakter user, dlsb. Perankingan menjadi tantangan utama dalam pengembangan search engine. Dari sinilah muncul ilmu SEO (Search engine Optimization).
Arsitektur Search Engine
Web Crawler
Web Crawler • Web crawler software yang digunakan mesin pencari untuk mendownload atau menyalin halaman web. • Web crawler = Web Spider, Web Robot, or simply Bot. • Web crawler bekerja secara periodik (Bisa mingguan, harian, bahkan per-jam).
Web Crawler Tantangan : • Jumlah web sangat banyak dan akan terus bertambah (growing) • Halaman web selalu berubah • Crawler harus dapat mengenali dan digunakan untuk tipe data yang bermacam-macam.
Cycle of a Crawling 1. Crawler mengunduh halaman web, mengurai kalimat/teks dan membaca link yang ada di halaman web tersebut. 2. Halaman web yang belum dikunjungi akan ditambahkan ke sistem antrian untuk diunduh kemudian. 3. Crawler akan kembali mengunduh halaman web baru, proses ini akan diulang terus sampai semua link dikunjungi atau sampai disk full.
Arsitektur Web Crawler
• Scheduler: maintains a queue of URLs to visit • Downloader: downloads the pages • Storage: makes the indexing of the pages, and provides the scheduler with metadata on the pages retrieved
Simple Crawler Thread
Re-visit Policy • Setiap halaman web hasil crawling mempunyai usia (age). • Age is a measure that indicates how outdated the local copy.
Head HTTP protocol has a special request type called HEAD that makes it easy to check for page changes. • returns information about page, not page itself
Contoh Crawler • Googlebot • Bingbot • Yahoo! Slurp
Open-source : • DataparkSearch • GRUB • Heritrix
Text Processor
Arsitektur Search Engine
Crawling Web Graph
Retrieving & Ranking Feedback
Processing Text • New words come from an unstructured documents and a variety of sources. Ex. : spelling errors, product, company names, code, other languages, email addresses, etc. • Tugas text processor mengubah (converting) dokumen menjadi term.
Term = word (kata)
Kamu harus tau! Term : digunakan sebagai kata-ganti kata (word). Why? Because a query term may in fact not be a word at all, may be a date, a number, a musical note, or a phrase. Contoh : Paman datang ke Jakarta sejak 1950
Processing Text • Case folding • Tokenizing • Normalization • Stopping • Stemming
Case folding • Case folding Mengubah huruf capital menjadi huruf kecil. Contoh : Teks asli : Paman datang ke Jakarta sejak 1950 Case Folding : paman datang ke jakarta sejak 1950
Tokenizing • Pembentukan term dari sederetan karakter.
• Early IR systems: • Term adalah karakter alfanumerik dengan panjang karakter 3 atau lebih. • Dibatasi dengan spasi atau karakter tertentu
Tokenizing Contoh :
Before : paman datang ke jakarta sejak 1950 After : paman datang jakarta sejak 1950 Before : Bigcorp's 2007 bi-annual report showed profits rose 10%. After : bigcorp 2007 annual report showed profits rose
Tokenizing Kekurangan : Too simple for search applications or even largescale experiments. Why? Too much information lost, small decisions in tokenizing can have major impact on effectiveness of some queries
Kasus “Tokenizing”
Kasus “Tokenizing”
Kasus “Tokenizing”
Karakter Word 1. Small words can be important in some queries, usually in combinations. Ex. : xp, ma, pm, ben e king, el paso, gm, j lo, world war II. 2. Both hyphenated and non-hyphenated forms of many words are common • Hyphen is not needed : e-bay, wal-mart, active-x, cd-rom, t-shirts. • Hyphens should be considered either as part of the word or a word separator : winston-salem, mazda rx-7, e-cards, pre-diabetes, t-mobile, spanish-speaking.
Karakter Word 3. Capitalized words can have different meaning from lower case words. Ex : Bush, Apple 4. Apostrophes can be a part of a word, a part of a possessive, or just a mistake. • Ex : rosie o'donnell, can't, don't, 80's, 1890's, men's straw hats, master's degree, england's ten largest cities, shriner's
Karakter Word 5. Numbers can be important, including decimals. • nokia 3250, top 10 courses, united 93, quicktime 6.5 pro, 92.3 the beat, 288358 6. Dot can occur in numbers, abbreviations, URLs, ends of sentences, and other situations. • I.B.M., Ph.D., cs.umass.edu, F.E.A.R. 7. Use parser to identify appropriate parts of document to tokenize. Ex : Stanford parser
Normalization • Proses transformasi teks menjadi bentuk standart. Contoh : bhs bahasa rmh rumah tgl tanggal aq aku, saya
Normalization
Stopping • Stopping : proses penghapusan kata yang termasuk dalam stopword list. • Stop Word : kata-kata yang tidak memiliki informasi penting. Contoh : (bahasa inggris) the, of, too, that, which. etc (bahasa Indonesia) yang, lah, wah, pun. dll • Karakter stopword : high occurrence frequencies
Stopping • Stopword list can be created from high-frequency words or based on a standard list. • Efek stopping : reduce index space, improve response time, improve effectiveness.
Top 50 Words from AP89
Stemming • Stemming Proses menyeleksi kata berimbuhan dan mengubah menjadi bentuk kata dasar (root). Contoh : membaca, dibaca, terbaca, bacakan = baca • Algoritma stemming : Porter stemmer (untuk b inggris) Nazief-Adriani stemmer (untuk b indo)
Stemming • Many morphological variations of words • inflectional (plurals, tenses) • derivational (making verbs nouns etc.) • In most cases, these have the same or very similar meanings • Stemmers attempt to reduce morphological variations of words to a common stem • usually involves removing suffixes • Can be done at indexing time or as part of query processing (like stopwords)
Stemming Generally a small but significant effectiveness improvement • can be crucial for some languages • e.g., 5-10% improvement for English, up to 50% in Arabic
Stemming Two basic types • Dictionary-based: uses lists of related words • Algorithmic: uses program to determine related words Algorithmic stemmers • suffix-s: remove ‘s’ endings assuming plural • e.g., cats → cat, lakes → lake, wiis → wii • Many false negatives: supplies → supplie • Some false positives: ups → up