Text Retrieval and Search Engines Week 1

圓圓小熊(Maruマル)
5 min readJun 9, 2018

--

This is the note I took from Text Retrieval and Search Engines of Coursera. You can access the course by applying to Coursera.

Week 1

Key Concepts:

  • Part of Speech tagging, syntactic analysis, semantic analysis, and ambiguity
  • “Bag of words” representation
  • Push, pull, querying, browsing
  • Probability ranking principle
  • Relevance
  • Vector space model
  • Dot product
  • Bit vector representation

Recommended Readings:

  • N. J. Belkin and W. B. Croft. 1992. Information filtering and information retrieval: Two sides of the same coin? Commun. ACM 35, 12 (Dec. 1992), 29–38.
  • C. Zhai and S. Massung, Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM Book Series, Morgan & Claypool Publishers, 2016. Chapters 1–6.

Motivation To Learn Text Mining?

  1. contains people’s opinion
  2. encodes human knowledge
  3. offers opportunity for discovering knowledge

Examples of Text Information System Applications

  1. Search
  2. Filtering/Recommendation
  3. Categorization
  4. Mining/Extraction
  5. Others

Pre-Course Quiz

Basic probability, linear algebra, computer science

Goals and Objectives

  • Explain some basic concepts in natural language processing, text information access.
  • Explain why text retrieval is often defined as a ranking problem.
  • Explain the basic idea of the vector space retrieval model and how to instantiate it with the simplest bit-vector representation.

Lesson1.1 Natural Language Content Analysis

NLP Challenges

  1. Word-level Ambiguity (design: V/N, root: multiple meaning)
  2. Syntactic Ambiguity (natural language processing:Modification, A man saw a boy with a telescope: PP Attachment)
  3. Anaphora resolution: John persuaded Bill to buy a TV for himself.(Who?)
  4. Presupposition: “He has quit smoking” => imply that he smoked before

NLP for Text Retrieval

  1. Must be general robust & efficient => Shallow NLP
  2. “Bag of words” representation tends to be sufficient for most search tasks
  3. deep NLP is needed for complex search

Additional Reading: Chris Manning and Hinrich Schütze foundations of statistical natural language processing

Lesson1.2 Text Access

How can a text information system help users get access to the relevant text data?

Pull vs Push

Pull Mode(search engines): Users type a query and the search engine reply

  • Querying: user enters a query, system return relevant documents, works well when the user knows what keywords to use
  • Browsing: user navigates into relevant information by following a path enabled by the structures on the documents, works well when the user wants to explore information, doesn’t know what keywords to use, or cant”t conveniently enter a query

Additional Reading: Belkin, N.J. and Croft, B.W. (1992) Information Filtering and Information Retrieval: Two Sides of the Same Coin? Communications of the ACM, 35, 29–38.

Push Mode(recommender systems): News monitor news streams and push article to you(When you search information from search engine)

Lesson1.3 Text Retrieval Problem

What is Text Retrieval?

Text Retrieval vs Database Retrieval

╔═════════════╦════════════════════════╦════════════════════════╗
║ ║ Text RetrievalDatabase Retrieval
╠═════════════╬════════════════════════╬════════════════════════╣
║ Information ║ Unstructured,Ambiguous ║ Structured,Well-defined║
║ Query ║ Ambiguous,Incomplete ║ Well-defined,Complete ║
║ Answers ║ Relevant Documents ║ Matched Records ║
╚═════════════╩════════════════════════╩════════════════════════╝

Text Retrieval needs empirical evaluation(users defined)

Formal Formulation of TR

Documents(ex:tweets)

Collection(ex:web)

How To Computer R’(q)

Strategy Ranking is prefered

Document Selection vs Ranking

Problems of Document Selection

  1. “Over-constrained” query -> no relevant documents to return
  2. “Under-constrained” query -> over delivery
  3. Hard to find the right position between these two extremes above
  4. All relevant documents may not be equally relevant(Prioritization needed)

Theoretical Justification for RANKING

  • Probability Ranking Principle: Returning a ranked list of documents in descending order of probability that a document is relevant to the query is the optimal strategy under the following two assumptions:
1. The utility of a document(to a user) is independent of the utility of any other document2. A user would browse the results sequentially
(remove redundancy is possibly required)

Additional Reading:

  1. S.E. RobertsonThe probability ranking principle in information retrieval,Journal of Documentation, 33, 4, 294–304, Dec 77
  2. C. J. Van Rijsbergen, Information Retrieval, 2nd Edition(MUST READ)!!!!!

Lesson 1.4: Overview of Text Retrieval Methods

How to Design a Ranking Function

Many Different Retrieval Models

Common Ideas in Retrieval Models

  1. Bag of words representation
  2. Term Frequency(TF) and Document Frequency(DF) of words
  3. Document Length

Additional Reading:

  1. For detailed discussion and comparison of state of the art models: Hui Fang, Tao Tao, ChengXiang Zhai, Diagnostic Evaluation of Information Retrieval Models.
  2. Broad review of different retrieval models: ChengXiang Zhai, Statistical Language Models for Information Retrieval

Lesson 1.5: Vector Space Model — Basic Idea

Similarity-based models: f(q,d) = similarity(q,d)

We can see that d2 is the most similar document to Query q

Lesson 1.6: Vector Space Retrieval Model — Simplest Instantiation

Dimension(Bag of words): Each words defined in our document

Vector Placement: Bit Vector(present:1, absent:0)

Similar Instantiation: Dot Product

重要的VSM相似公式

Guiding Questions

  • What does a computer have to do in order to understand a natural language sentence?
  • What is ambiguity?
  • Why is natural language processing (NLP) difficult for computers?
  • What is bag-of-words representation? Why do modern search engines use this simple representation of text?
  • What are the two modes of text information access? Which mode does a web search engine such as Google support?
  • When is browsing more useful than querying to help a user find relevant information?
  • Why is a text retrieval task defined as a ranking task?
  • What is a retrieval model?
  • What are the two assumptions made by the Probability Ranking Principle?
  • What is the Vector Space Retrieval Model? How does it work?
  • How do we define the dimensions of the Vector Space Model? What does “bag of words” representation mean?
  • What does the retrieval function intuitively capture when we instantiate a vector space model with bag of words representation and bit representation for documents and queries?

--

--

圓圓小熊(Maruマル)
圓圓小熊(Maruマル)

Written by 圓圓小熊(Maruマル)

在日本東京打拼的軟體工程師。喜歡滑雪,旅遊,吃美食,寫程式,參加Meetup,以及比價。最近沈迷寫Chatbot,Django以及把玩AWS。

No responses yet