Text Retrieval and Search Engines Week 1

5 min readJun 9, 2018

This is the note I took from Text Retrieval and Search Engines of Coursera. You can access the course by applying to Coursera.

Week 1

Key Concepts:

Part of Speech tagging, syntactic analysis, semantic analysis, and ambiguity
“Bag of words” representation
Push, pull, querying, browsing
Probability ranking principle
Relevance
Vector space model
Dot product
Bit vector representation

Recommended Readings:

N. J. Belkin and W. B. Croft. 1992. Information filtering and information retrieval: Two sides of the same coin? Commun. ACM 35, 12 (Dec. 1992), 29–38.
C. Zhai and S. Massung, Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM Book Series, Morgan & Claypool Publishers, 2016. Chapters 1–6.

Motivation To Learn Text Mining?

contains people’s opinion
encodes human knowledge
offers opportunity for discovering knowledge

Examples of Text Information System Applications

Search
Filtering/Recommendation
Categorization
Mining/Extraction
Others

Pre-Course Quiz

Basic probability, linear algebra, computer science

Goals and Objectives

Explain some basic concepts in natural language processing, text information access.
Explain why text retrieval is often defined as a ranking problem.
Explain the basic idea of the vector space retrieval model and how to instantiate it with the simplest bit-vector representation.

Lesson1.1 Natural Language Content Analysis

NLP Challenges

Word-level Ambiguity (design: V/N, root: multiple meaning)
Syntactic Ambiguity (natural language processing:Modification, A man saw a boy with a telescope: PP Attachment)
Anaphora resolution: John persuaded Bill to buy a TV for himself.(Who?)
Presupposition: “He has quit smoking” => imply that he smoked before

NLP for Text Retrieval

Must be general robust & efficient => Shallow NLP
“Bag of words” representation tends to be sufficient for most search tasks
deep NLP is needed for complex search

Additional Reading: Chris Manning and Hinrich Schütze foundations of statistical natural language processing

Lesson1.2 Text Access

How can a text information system help users get access to the relevant text data?

Pull vs Push

Pull Mode(search engines): Users type a query and the search engine reply

Querying: user enters a query, system return relevant documents, works well when the user knows what keywords to use
Browsing: user navigates into relevant information by following a path enabled by the structures on the documents, works well when the user wants to explore information, doesn’t know what keywords to use, or cant”t conveniently enter a query

Additional Reading: Belkin, N.J. and Croft, B.W. (1992) Information Filtering and Information Retrieval: Two Sides of the Same Coin? Communications of the ACM, 35, 29–38.

Push Mode(recommender systems): News monitor news streams and push article to you(When you search information from search engine)

Lesson1.3 Text Retrieval Problem

What is Text Retrieval?

Text Retrieval vs Database Retrieval

╔═════════════╦════════════════════════╦════════════════════════╗
║             ║      Text Retrieval    ║    Database Retrieval  ║
╠═════════════╬════════════════════════╬════════════════════════╣
║ Information ║ Unstructured,Ambiguous ║ Structured,Well-defined║
║ Query       ║ Ambiguous,Incomplete   ║ Well-defined,Complete  ║
║ Answers     ║ Relevant Documents     ║ Matched Records        ║
╚═════════════╩════════════════════════╩════════════════════════╝

Text Retrieval needs empirical evaluation(users defined)

Formal Formulation of TR

Documents(ex:tweets)

Collection(ex:web)

How To Computer R’(q)

Document Selection vs Ranking

Problems of Document Selection

“Over-constrained” query -> no relevant documents to return
“Under-constrained” query -> over delivery
Hard to find the right position between these two extremes above
All relevant documents may not be equally relevant(Prioritization needed)

Theoretical Justification for RANKING

Probability Ranking Principle: Returning a ranked list of documents in descending order of probability that a document is relevant to the query is the optimal strategy under the following two assumptions:

1. The utility of a document(to a user) is independent of the utility of any other document2. A user would browse the results sequentially
(remove redundancy is possibly required)

Additional Reading:

S.E. RobertsonThe probability ranking principle in information retrieval,Journal of Documentation, 33, 4, 294–304, Dec 77
C. J. Van Rijsbergen, Information Retrieval, 2nd Edition(MUST READ)!!!!!

Lesson 1.4: Overview of Text Retrieval Methods

How to Design a Ranking Function

Many Different Retrieval Models

Common Ideas in Retrieval Models

Bag of words representation
Term Frequency(TF) and Document Frequency(DF) of words
Document Length

Additional Reading:

For detailed discussion and comparison of state of the art models: Hui Fang, Tao Tao, ChengXiang Zhai, Diagnostic Evaluation of Information Retrieval Models.
Broad review of different retrieval models: ChengXiang Zhai, Statistical Language Models for Information Retrieval

Lesson 1.5: Vector Space Model — Basic Idea

Similarity-based models: f(q,d) = similarity(q,d)

**We can see that d2 is the most similar document to Query q**

Lesson 1.6: Vector Space Retrieval Model — Simplest Instantiation

Dimension(Bag of words): Each words defined in our document

Vector Placement: Bit Vector(present:1, absent:0)

Similar Instantiation: Dot Product

Guiding Questions

What does a computer have to do in order to understand a natural language sentence?
What is ambiguity?
Why is natural language processing (NLP) difficult for computers?
What is bag-of-words representation? Why do modern search engines use this simple representation of text?
What are the two modes of text information access? Which mode does a web search engine such as Google support?
When is browsing more useful than querying to help a user find relevant information?
Why is a text retrieval task defined as a ranking task?
What is a retrieval model?
What are the two assumptions made by the Probability Ranking Principle?
What is the Vector Space Retrieval Model? How does it work?
How do we define the dimensions of the Vector Space Model? What does “bag of words” representation mean?
What does the retrieval function intuitively capture when we instantiate a vector space model with bag of words representation and bit representation for documents and queries?