《信息檢索六tfidf》由會(huì)員分享,可在線閱讀,更多相關(guān)《信息檢索六tfidf(47頁珍藏版)》請(qǐng)?jiān)谘b配圖網(wǎng)上搜索。
1、單擊此處編輯母版標(biāo)題樣式,單擊此處編輯母版文本樣式,第二級(jí),第三級(jí),第四級(jí),第五級(jí),*,互聯(lián)網(wǎng)信息搜索,湖南大學(xué)計(jì)算機(jī)與通信學(xué)院,劉鈺峰,互聯(lián)網(wǎng)信息搜索六,tfidf and,vector spaces,回顧,1、中文分詞,2、詞典壓縮,3、posting list壓縮,4、tfidf,Scoring documents,How do we construct an index?,What strategies can we use with limited main memory?,Scoring,We wish to return in order the documents most l
2、ikely to be useful to the searcher,How can we rank order the docs in the corpus with respect to a query?,Assign a score say in 0,1,for each doc on each query,Begin with a perfect world no spammers,Nobody stuffing keywords into a doc to make it match queries,More on“adversarial IR”under web search,Li
3、near zone combinations,First generation of scoring methods:use a linear combination of Booleans:,E.g.,Score=0.6*,+0.3*+0.05*+0.05*,Each expression such as takes on a value in 0,1.,Then the overall score is in 0,1.,For this example the scores can only take,on a finite set of values what are they?,Exe
4、rcise,On the query,bill,OR,rights,suppose that we retrieve the following docs from the various zone indexes:,bill,rights,bill,rights,bill,rights,Author,Title,Body,1,5,2,8,3,3,5,9,2,5,1,5,8,3,9,9,Compute the score,for each doc based on the weightings 0.6,0.3,0.1,General idea,We are given a,weight vec
5、tor,whose components sum up to 1.,There is a weight for each zone/field.,Given a Boolean query,we assign a score to each doc by adding up the weighted contributions of the zones/fields.,Typically users want to see the,K,highest-scoring docs.,Index support for zone combinations,In the simplest versio
6、n we have a separate inverted index for each zone,Variant:have a single index with a separate dictionary entry for each term and zone,E.g.,bill.author,bill.title,bill.body,1,2,5,8,3,2,5,1,9,Of course,compress zone names,like author/title/body.,Zone combinations index,The above scheme is still wastef
7、ul:each term is potentially replicated for each zone,In a slightly better scheme,we encode the zone in the postings:,At query time,accumulate contributions to the total score of a document from the various postings,e.g.,bill,1.author,1.body,2.author,2.body,3.title,As before,the zone names get compre
8、ssed.,bill,1.author,1.body,2.author,2.body,3.title,rights,3.title,3.body,5.title,5.body,Score accumulation,As we walk the postings for the query,bill,OR,rights,we accumulate scores for each doc in a linear merge as before.,Note:we get,both,bill,and,rights,in the,Title,field of doc 3,but score it no
9、higher.,Should we give more weight to more hits?,1,2,3,5,0.7,0.7,0.4,0.4,Term-document count matrices,Consider the number of occurrences of a term in a document:,Bag of words,model,Document is a vector:a column below,Bag of words view of a doc,Thus the doc,John is quicker than Mary,.,is indistinguis
10、hable from the doc,Mary is quicker than John,.,Which of the indexes discussed,so far distinguish these two docs?,Counts vs.frequencies,WARNING,:In a lot of IR literature,“frequency”is used to mean“count”,Thus,term frequency,in IR literature is used to mean,number of occurrences,in a doc,Not,divided
11、by document length(which would actually make it a frequency),We will conform to this misnomer,In saying,term frequency,we mean the,number of occurrences,of a term in a document.,Term frequency,tf,Long docs are favored,because theyre more likely to contain query terms,Can fix this to some extent by n
12、ormalizing for document length,But is raw,tf,the right measure?,Document frequency,But document frequency(,df,)may be better:,df,=number of docs in the corpus containing the term,Word,cf,df,ferrari,1042217,insurance,104403997,Document/collection frequency weighting is only possible in known(static)c
13、ollection.,So how do we make use of,df,?,tf x idf term weights,tf x idf measure combines:,term frequency(,tf,),or,wf,some measure of term density in a doc,inverse document frequency(,idf,),measure of informativeness of a term:its rarity across the whole corpus,could just be raw count of number of do
14、cuments the term occurs in(,idf,i,=,1/,df,i,),but by far the most commonly used version is:,See Kishore Papineni,NAACL 2,2002 for theoretical justification,Summary:tf x idf(or tf.idf),Assign a tf.idf weight to each term,i,in each document,d,Increases with the number of occurrences,within,a doc,Incre
15、ases with the rarity of the term,across,the whole corpus,再論TF,Real-valued term-document matrices,Function(scaling)of count of a word in a document:,Bag of words,model,Each is a vector in,v,Here log-scaled,tf.idf,Note can be 1!,Documents as vectors,Each doc,j,can now be viewed as a vector of,wf,idf,v
16、alues,one component for each term,So we have a vector space,terms are axes,docs live in this space,even with stemming,may have 20,000+dimensions,(The corpus of documents gives us a matrix,which we could also view as a vector space in which words live transposable data),Why turn docs into vectors?,First application:Query-by-example,Given a doc,d,find others“l(fā)ike”it.,Now that,d,is a vector,find vectors(docs)“near”it.,Intuition,Postulate:Documents that are“close together”,in the vector space talk a