Web Search
Challenges | Application |
---|---|
scalability | parallel indexing & searching (MapReduce) |
low quality information and spams | Spam detection & Robust ranking |
dynamics of the web | -- |
Opportunities | many additional heuristics 啟發 can be leveraged 槓桿原理 to improve search accuracy |
rich link information, layout, etc | link analysis & multi-feature ranking |
Web → Crawler → Indexer <-> Retriever <-> Browser ← User
Major Crawling Strategies:
- Breadth-First
- Parallel Crawing
- Variation: focused crawling
- Targeting at a subset of pages
- Typically given a query
- Targeting at a subset of pages
- Incremental / Repeated Crawling
- need to minimuizer resource overhead
- can learn from the past experience (updated daily vs. monthly)
- Target at: 1) frequently uptated pages 2) Frequently accessed pages
- need to minimuizer resource overhead
Link Analysis
Ranking Algorithms for Web Search
Standard IR aren't sufficient for: different information needs, documents that have additional information, or information quality that varies a lot.
Major Extensions:
exploiting links to improve scoring
exploiting clickthroughs for massive implicit feedback