introduction to Information Retrieval 阅读笔记之第一章

2020-06-26 14:43:01 阅读：337 来源： 互联网

标签：Information documents index introduction Retrieval element data retrieval docume

引言

在联系了保研导师后，导师决定这个暑假让我开始学习信息检索技术，并直接给我发了一本英文版IR大作——《Introduction to information retrieval》，并让我每看完一章写一个英文报告。据说此书是IR入门宝书，所以第一眼看到是~~头皮发麻的~~ (心情激动的)。那么从今天开始我也会在博客上同步更新我的阅读总结//啃书史，希望能够一直坚持！

想要原书pdf版本的去网盘下吧~
链接：https://pan.baidu.com/s/1PkJ-I-HNfyNHgyb5aZGNaA
提取码：w8h8

Chapter 1: Boolean retrieval

In the first chapter, the author use a example about information intrieval to introduce Booleanretrieval. Then the definitions of binary term-document incidence matrix and inverted index and their applications in Boolean retrieval are introduced respectively. In addition, the author also introduce other main methods and some difficulties of information retrieval.

definitions

Here are definitions of terms in this chapter:

Information retrieval：Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfifies an information need from within large collections (usually stored on computers).
Term-document incidence matrix: A matrix to show whether a term is contained in a document(defined by me). Matrix element (t, d) is 1 if the play in column d contains the word in row t, and is 0 otherwise. Here is an example: 在这里插入图片描述
Inverted index: It is composed of two parts: dictionary in memory and Postings in the disk. Dictionary contains all the terms shown in the dicuments. And for each term, it has a pointer to its posting in the disk which records the id of documents that a term appeared. For example, in the below index, wocan see the term “Calpurnia” is contained in document 2,31,54,101.
在这里插入图片描述

detailed summary

Here are my summary and thinking:
In general, the first chapter is how to use Boolean Retrieval to search for documents with specific word combination conditions in a large number of documents. By definition, information retrieval works on unstructured data, not structured data. ”structured data” refers to data that does not have clear, semantically overt, easy-for-a-computer structure. For example, the data stored in a relational database is structured data. So we can use SQL statements directly to look up structured data. For unstructured data, we usually think that it is not completely unstructured, and we can find hidden language structure information in it to try to build the structure.
For a large number of web documents, we must first index all the documents. When users use specific keywords to retrieval documents, we only need to extract the index to carry out certain operations to get the retrieval results. The establishment of index in advance can greatly speed up the retrieval time and reduce the repetitive workload. The index methods introduced in the first chapter respectively use term-document incidence matrix and Inverted index.
For Term-document incidence matrix, we get the results vector by extracing vectors of the terms given by user and then carrying out “and or not” logic operations. In the result vecto, “1” represents that a document is matching. Taking the picture in the book for example, if query is “Brutus AND Caesar AND NOT Calpurnia”, then we just have to use the three vectors to do some logical operations.
在这里插入图片描述
As for Inverted index, it is taken out by the author because there are many “0” in the term-document incidence matrix, that is to say, it is a sparse matrix. In this case, linked list is often used to replace the matrix in the data structure. In Inverted index, there are still terms, and each term no longer has a vector of equal length. Instead, a linked list or variable length array of document sequence numbers is used to record the documents in which the words appear. Thus, the logical operation between bit vectors becomes the logical operation between sets, but the essence remains the same. For how to quickly carry out logical operation on the collection represented by the linked list, efficient algorithm is introduced in the book, making time complexity o(x+y), where x and y respectively represent the length of two sets. The algorithm idea is as follows:
AND operation: Maintains a pointer to the current comparison element in two linked lists. If the elements being compared are value-equal, the element will be add into the result set with both pointers moving backwards. If not identical, the pointer to the element with the lower value moves backwards. The algorithm ends when either list has been traversed.
OR operation: Maintains a pointer to the current comparison element in two linked lists. When the elements being compared are the same, the element will be add into the result set, and both pointers move back. If not identical, the pointer to the element with a lower value is moved back, and the element is add into the result set. When both lists are traversed, the algorithm ends.
AND NOT: take “A AND NOT B” as an example. Maintains a pointer to the current comparison element in two linked lists. When the elements being compared are the value-equal, both pointers move backwards. If the values are different and the current comparison element of B is smaller, move the pointer to B backward. If the values are different and the current comparison element of A is smaller, add the current comparison element of A to the result set and move the pointer of A backward. When both lists are traversed, the algorithm ends.

optimization

Some algorithm optimization for Boolean retrieval:

Place the shortest Posting term at the front of logical operation list, which can reduce the number of operations.
For two iterms whose length of Posting has large differece, we can use some strategies to speed up. Such as searching the long posting list using binary search method to find each item in the short posting list. If the postings are ordered, this can greatly speed up the search rate.

Limitations

Limitations of the retrieval methods in Chapter 1:

Unable to recognize the spelling mistakes of words, and requires high accuracy of words.
The neglect of synonyms. Some retrievaling word may have sunonyms and researcher also wants the system to be able to return documents with synonyms.
Unable to sort the returned documents, because Boolean sort can only determine whether a document should be retrieved, but there is no way to calculate its matching degree, so the documents cannot be sorted.
It is useless for queries with specific requirements. Let’s say I want to find A and B, while AB is in the same sentence. Because I do not record the word position in posting, I cannot query.

My question:

It is said in the book that there is pointer mapping from memory to hard disk. How is this realized? There are pointers in C++, but that’s
limited to memory. How can I build pointers across memory and hard
disk?

Since the document index is pre-established, for the ever-changing collection of documents on the Internet, are the new (updated)
documents updating index all the time?

标签：Information,documents,index,introduction,Retrieval,element,data,retrieval,docume
来源： https://blog.csdn.net/qq_42728797/article/details/106950477

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9