Introduction to Information Retrieval
Penelusuran Informasi (Information Retrieval) Sumber: CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Taufik Fuadi Abidin
Link Analysis
Penelusuran Informasi (Information Retrieval)
Pembahasan Minggu Ini Hypertext dan Link (tautan): melihat lebih jauh, tidak hanya aspek tektual dari dokumen HTML Makna dari keterkaitan Aplikasi terkait
2
Penelusuran Informasi (Information Retrieval)
Hyperlink (tautan) Link dapat dimanfaatkan sebagai sumber daya yang potensial dalam menentukan authenticity & authority Jenis Link: Good, Bad & Unknown ? Good
?
?
Bad
? 3
Penelusuran Informasi (Information Retrieval)
Simple Iterative Logic Link yang Good tidak akan tunjuk ke node yang Bad Jika Good node link ke Bad node maka Good node akan menjadi Bad node Jika Good node link ke node tertentu, maka node tersebut merupakan node yang Good ? Good
?
?
Bad
? 4
Penelusuran Informasi (Information Retrieval)
Simple Iterative Logic
Good
?
Bad
?
Bad
?
Good
5
Penelusuran Informasi (Information Retrieval)
Contoh Link Analysis Social networks Page Ranks http://www.cs.cornell.edu/home/kleinber/networks-book/
6
Sec. 21.1
Penelusuran Informasi (Information Retrieval)
Web Merupakan Directed Graph
Page A
Anchor
hyperlink
Page B
Asumsi 1: hyperlink antara halaman HTML menyatakan authority dari halaman tersebut (quality signal) Asumsi 2: Text yang ada pada anchor dari sebuah hyperlink menggambarkan (a glance) halaman target/tujuan (textual context)
Introduction to Information Retrieval
Asumsi 1: Site Dengan Reputasi Baik
8
Introduction to Information Retrieval
Asumsi 2: Menggambarkan Target
9
Sec. 21.1.1
Penelusuran Informasi (Information Retrieval)
Indexing Anchor Text Ketika mengindeks documen D, dengan bobot tertentu, anchor text yang tunjuk ke dokumen D juga di-include-kan Armonk, NY-based computer giant IBM announced today
Big Blue today announced record profits for the quarter
www.ibm.com Joe’s computer hardware links Sun HP IBM
Anchor text weight tergantung authority dari anchor page’s website Jika diasumsikan bahwa konten dari cnn.com atau yahoo.com adalah authoritative (bagus) maka percaya semua anchor text dari web itu
Penelusuran Informasi (Information Retrieval)
Citation Analysis (Part of Link Analysis) Citation frequency Bibliographic coupling frequency Articles that co-cite the same articles are related
Citation indexing Who is this author cited by? (Garfield 1972)
Penelusuran Informasi (Information Retrieval)
Isu: Web isn’t scholarly citation Miliaran partisipan, setiap partisipan punya tujuan sendiri-sendiri Spam ada dimana-mana Ketika SE mulai menggunakan link untuk ranking, (sekitar tahun 1998), spam terkait link tumbuh berkembang
12
Penelusuran Informasi (Information Retrieval)
Link Analysis Page Rank (Google) A link analysis algorithm that assigns a weight to each page to measure the relative importance of the page within the set. It interprets a link from page A to page B as a vote, by page A, for page B (see The Anatomy of a Large-Scale Hypertextual Web Search Engine. Brin, S.; Page, L., 1998)
ExpertRank (Ask.com) Subject-Specific Popularity: analyzed links in context to rank a web page's importance within its specific subject. Ex. a web page about ‘basketball’ would rank higher if other web pages about ‘basketball' link to it Presented at the General Lecture, Mathematics Department, Syiah Kuala University, May 2007
13
Penelusuran Informasi (Information Retrieval)
Page Rank Algorithm Diketahui node A, B, C, and D. Initial PR adalah 0.25 Jika B, C, D tunjuk ke A maka PR(A) = PR(B) + PR(C) + PR(D) B
A
D
C
Namun, jika B jika tunjuk (link) ke C, dan D link ke semau (A, B, dan C), maka vote untuk setiap node dinormalisasi dengan membagi jumlah outbound links ke halaman tersebut. B
Vote B adalah 0.125 untuk node A dan C Vote D = 0.25/3 = 0.081 untuk node A, B, dan C
A
PR(A) = PR(B)/OL(B) + PR(C)/OL(C) + PR(D)/OL(D) = 0.456
Presented at the General Lecture, Mathematics Department, Syiah Kuala University, May 2007
D
C
14
Penelusuran Informasi (Information Retrieval)
Sec. 21.3
Hyperlink-Induced Topic Search (HITS) In response to a query, instead of an ordered list of pages each meeting the query, find two sets of interrelated pages: Hub pages are good lists of links on a subject. e.g., “Bob’s list of cancer-related links.”
Authority pages occur recurrently on good hubs for the subject.
Penelusuran Informasi (Information Retrieval)
Sec. 21.3
Hubs and Authorities Thus, a good hub page for a topic points to many authoritative pages for that topic. A good authority page for a topic is pointed to by many good hubs for that topic.
Penelusuran Informasi (Information Retrieval)
Sec. 21.3
High-level scheme Extract from the web a base set of pages that could be good hubs or authorities. From these, identify a small set of top hub and authority pages; →iterative algorithm.
Sec. 21.3
Penelusuran Informasi (Information Retrieval)
Representasi n×n adjacency matrix A: each of the n pages in the base set has a row and column in the matrix. Entry Aij = 1 if page i links to page j, else = 0.
1
2 3
1
1 0
2 1
3 0
2
1
1
1
3
1
0
0