Text Retrieval and Search Engines Week 1
This is the note I took from Text Retrieval and Search Engines of Coursera. You can access the course by applying to Coursera.
Week 1
Key Concepts:
- Part of Speech tagging, syntactic analysis, semantic analysis, and ambiguity
- “Bag of words” representation
- Push, pull, querying, browsing
- Probability ranking principle
- Relevance
- Vector space model
- Dot product
- Bit vector representation
Recommended Readings:
- N. J. Belkin and W. B. Croft. 1992. Information filtering and information retrieval: Two sides of the same coin? Commun. ACM 35, 12 (Dec. 1992), 29–38.
- C. Zhai and S. Massung, Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM Book Series, Morgan & Claypool Publishers, 2016. Chapters 1–6.
Motivation To Learn Text Mining?
- contains people’s opinion
- encodes human knowledge
- offers opportunity for discovering knowledge
Examples of Text Information System Applications
- Search
- Filtering/Recommendation
- Categorization
- Mining/Extraction
- Others
Pre-Course Quiz
Basic probability, linear algebra, computer science
Goals and Objectives
- Explain some basic concepts in natural language processing, text information access.
- Explain why text retrieval is often defined as a ranking problem.
- Explain the basic idea of the vector space retrieval model and how to instantiate it with the simplest bit-vector representation.
Lesson1.1 Natural Language Content Analysis
NLP Challenges
- Word-level Ambiguity (design: V/N, root: multiple meaning)
- Syntactic Ambiguity (natural language processing:Modification, A man saw a boy with a telescope: PP Attachment)
- Anaphora resolution: John persuaded Bill to buy a TV for himself.(Who?)
- Presupposition: “He has quit smoking” => imply that he smoked before
NLP for Text Retrieval
- Must be general robust & efficient => Shallow NLP
- “Bag of words” representation tends to be sufficient for most search tasks
- deep NLP is needed for complex search
Additional Reading: Chris Manning and Hinrich Schütze foundations of statistical natural language processing
Lesson1.2 Text Access
How can a text information system help users get access to the relevant text data?
Pull vs Push
Pull Mode(search engines): Users type a query and the search engine reply
- Querying: user enters a query, system return relevant documents, works well when the user knows what keywords to use
- Browsing: user navigates into relevant information by following a path enabled by the structures on the documents, works well when the user wants to explore information, doesn’t know what keywords to use, or cant”t conveniently enter a query
Additional Reading: Belkin, N.J. and Croft, B.W. (1992) Information Filtering and Information Retrieval: Two Sides of the Same Coin? Communications of the ACM, 35, 29–38.
Push Mode(recommender systems): News monitor news streams and push article to you(When you search information from search engine)
Lesson1.3 Text Retrieval Problem
What is Text Retrieval?
Text Retrieval vs Database Retrieval
╔═════════════╦════════════════════════╦════════════════════════╗
║ ║ Text Retrieval ║ Database Retrieval ║
╠═════════════╬════════════════════════╬════════════════════════╣
║ Information ║ Unstructured,Ambiguous ║ Structured,Well-defined║
║ Query ║ Ambiguous,Incomplete ║ Well-defined,Complete ║
║ Answers ║ Relevant Documents ║ Matched Records ║
╚═════════════╩════════════════════════╩════════════════════════╝
Text Retrieval needs empirical evaluation(users defined)
Formal Formulation of TR
Documents(ex:tweets)
Collection(ex:web)
How To Computer R’(q)
Document Selection vs Ranking
Problems of Document Selection
- “Over-constrained” query -> no relevant documents to return
- “Under-constrained” query -> over delivery
- Hard to find the right position between these two extremes above
- All relevant documents may not be equally relevant(Prioritization needed)
Theoretical Justification for RANKING
- Probability Ranking Principle: Returning a ranked list of documents in descending order of probability that a document is relevant to the query is the optimal strategy under the following two assumptions:
1. The utility of a document(to a user) is independent of the utility of any other document2. A user would browse the results sequentially
(remove redundancy is possibly required)
Additional Reading:
- S.E. RobertsonThe probability ranking principle in information retrieval,Journal of Documentation, 33, 4, 294–304, Dec 77
- C. J. Van Rijsbergen, Information Retrieval, 2nd Edition(MUST READ)!!!!!
Lesson 1.4: Overview of Text Retrieval Methods
How to Design a Ranking Function
Many Different Retrieval Models
Common Ideas in Retrieval Models
- Bag of words representation
- Term Frequency(TF) and Document Frequency(DF) of words
- Document Length
Additional Reading:
- For detailed discussion and comparison of state of the art models: Hui Fang, Tao Tao, ChengXiang Zhai, Diagnostic Evaluation of Information Retrieval Models.
- Broad review of different retrieval models: ChengXiang Zhai, Statistical Language Models for Information Retrieval
Lesson 1.5: Vector Space Model — Basic Idea
Similarity-based models: f(q,d) = similarity(q,d)
Lesson 1.6: Vector Space Retrieval Model — Simplest Instantiation
Dimension(Bag of words): Each words defined in our document
Vector Placement: Bit Vector(present:1, absent:0)
Similar Instantiation: Dot Product
Guiding Questions
- What does a computer have to do in order to understand a natural language sentence?
- What is ambiguity?
- Why is natural language processing (NLP) difficult for computers?
- What is bag-of-words representation? Why do modern search engines use this simple representation of text?
- What are the two modes of text information access? Which mode does a web search engine such as Google support?
- When is browsing more useful than querying to help a user find relevant information?
- Why is a text retrieval task defined as a ranking task?
- What is a retrieval model?
- What are the two assumptions made by the Probability Ranking Principle?
- What is the Vector Space Retrieval Model? How does it work?
- How do we define the dimensions of the Vector Space Model? What does “bag of words” representation mean?
- What does the retrieval function intuitively capture when we instantiate a vector space model with bag of words representation and bit representation for documents and queries?