信息檢索六tfidf

上傳人：奇*** 文檔編號：253393261 上傳時間：2024-12-12 格式：PPT 頁數(shù)：47 大?。?07.50KB

收藏版權申訴舉報下載

第1頁 / 共47頁

第2頁 / 共47頁

第3頁 / 共47頁

下載文檔到電腦，查找使用更方便

28 積分

下載資源

還剩頁未讀，繼續(xù)閱讀

資源描述：

《信息檢索六tfidf》由會員分享，可在線閱讀，更多相關《信息檢索六tfidf（47頁珍藏版）》請在裝配圖網(wǎng)上搜索。

1、單擊此處編輯母版標題樣式,單擊此處編輯母版文本樣式,第二級,第三級,第四級,第五級,*,互聯(lián)網(wǎng)信息搜索,湖南大學計算機與通信學院,劉鈺峰,互聯(lián)網(wǎng)信息搜索六,tfidf and,vector spaces,回顧,1、中文分詞,2、詞典壓縮,3、posting list壓縮,4、tfidf,Scoring documents,How do we construct an index?,What strategies can we use with limited main memory?,Scoring,We wish to return in order the documents most l

2、ikely to be useful to the searcher,How can we rank order the docs in the corpus with respect to a query?,Assign a score say in 0,1,for each doc on each query,Begin with a perfect world no spammers,Nobody stuffing keywords into a doc to make it match queries,More on“adversarial IR”under web search,Li

3、near zone combinations,First generation of scoring methods:use a linear combination of Booleans:,E.g.,Score=0.6*,+0.3*+0.05*+0.05*,Each expression such as takes on a value in 0,1.,Then the overall score is in 0,1.,For this example the scores can only take,on a finite set of values what are they?,Exe

4、rcise,On the query,bill,OR,rights,suppose that we retrieve the following docs from the various zone indexes:,bill,rights,bill,rights,bill,rights,Author,Title,Body,1,5,2,8,3,3,5,9,2,5,1,5,8,3,9,9,Compute the score,for each doc based on the weightings 0.6,0.3,0.1,General idea,We are given a,weight vec

5、tor,whose components sum up to 1.,There is a weight for each zone/field.,Given a Boolean query,we assign a score to each doc by adding up the weighted contributions of the zones/fields.,Typically users want to see the,K,highest-scoring docs.,Index support for zone combinations,In the simplest versio

6、n we have a separate inverted index for each zone,Variant:have a single index with a separate dictionary entry for each term and zone,E.g.,bill.author,bill.title,bill.body,1,2,5,8,3,2,5,1,9,Of course,compress zone names,like author/title/body.,Zone combinations index,The above scheme is still wastef

7、ul:each term is potentially replicated for each zone,In a slightly better scheme,we encode the zone in the postings:,At query time,accumulate contributions to the total score of a document from the various postings,e.g.,bill,1.author,1.body,2.author,2.body,3.title,As before,the zone names get compre

8、ssed.,bill,1.author,1.body,2.author,2.body,3.title,rights,3.title,3.body,5.title,5.body,Score accumulation,As we walk the postings for the query,bill,OR,rights,we accumulate scores for each doc in a linear merge as before.,Note:we get,both,bill,and,rights,in the,Title,field of doc 3,but score it no

9、higher.,Should we give more weight to more hits?,1,2,3,5,0.7,0.7,0.4,0.4,Term-document count matrices,Consider the number of occurrences of a term in a document:,Bag of words,model,Document is a vector:a column below,Bag of words view of a doc,Thus the doc,John is quicker than Mary,.,is indistinguis

10、hable from the doc,Mary is quicker than John,.,Which of the indexes discussed,so far distinguish these two docs?,Counts vs.frequencies,WARNING,:In a lot of IR literature,“frequency”is used to mean“count”,Thus,term frequency,in IR literature is used to mean,number of occurrences,in a doc,Not,divided

11、by document length(which would actually make it a frequency),We will conform to this misnomer,In saying,term frequency,we mean the,number of occurrences,of a term in a document.,Term frequency,tf,Long docs are favored,because theyre more likely to contain query terms,Can fix this to some extent by n

12、ormalizing for document length,But is raw,tf,the right measure?,Document frequency,But document frequency(,df,)may be better:,df,=number of docs in the corpus containing the term,Word,cf,df,ferrari,1042217,insurance,104403997,Document/collection frequency weighting is only possible in known(static)c

13、ollection.,So how do we make use of,df,?,tf x idf term weights,tf x idf measure combines:,term frequency(,tf,),or,wf,some measure of term density in a doc,inverse document frequency(,idf,),measure of informativeness of a term:its rarity across the whole corpus,could just be raw count of number of do

14、cuments the term occurs in(,idf,i,=,1/,df,i,),but by far the most commonly used version is:,See Kishore Papineni,NAACL 2,2002 for theoretical justification,Summary:tf x idf(or tf.idf),Assign a tf.idf weight to each term,i,in each document,d,Increases with the number of occurrences,within,a doc,Incre

15、ases with the rarity of the term,across,the whole corpus,再論TF,Real-valued term-document matrices,Function(scaling)of count of a word in a document:,Bag of words,model,Each is a vector in,v,Here log-scaled,tf.idf,Note can be 1!,Documents as vectors,Each doc,j,can now be viewed as a vector of,wf,idf,v

16、alues,one component for each term,So we have a vector space,terms are axes,docs live in this space,even with stemming,may have 20,000+dimensions,(The corpus of documents gives us a matrix,which we could also view as a vector space in which words live transposable data),Why turn docs into vectors?,First application:Query-by-example,Given a doc,d,find others“l(fā)ike”it.,Now that,d,is a vector,find vectors(docs)“near”it.,Intuition,Postulate:Documents that are“close together”,in the vector space talk a

展開閱讀全文

溫馨提示:
1: 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2: 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
3.本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會有圖紙預覽，若沒有圖紙預覽就沒有圖紙。
4. 未經(jīng)權益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 裝配圖網(wǎng)僅提供信息存儲空間，僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理，對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對任何下載內(nèi)容負責。
6. 下載文件中如有侵權或不適當內(nèi)容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

點擊下載此資源

信息檢索六tfidf

最新文檔

相關資源

相關搜索