1. INTRODUCTION 2. RETRIEVAL STRATEGIES 2.1 Vector Space Model 2.2 Probabilistic Retrieval Strategies 2.3 Language Models 2.4 Inference Networks 2.5 Extended Boolean Retrieval 2.6 Latent Semantic Indexing 2.7 Neural Networks 2.8 Genetic Algorithms 2.9 Fuzzy Set Retrieval 2.10 Summary 2.11 Exercises 3. RETRIEVAL UTILITIES 3.1 Relevance Feedback 3.2 Clustering 3.3 Passage-based Retrieval 3.4 N-grams 3.5 Regression Analysis 3.6 Thesauri 3.7 Semantic Networks 3.8 Parsing 3.9 Summary 3.10 Exercises 4. CROSS-LANGUAGE INFORMATION RETRIEVAL 4.1 Introduction 4.2 Crossing the Language Barrier 4.3 Cross-Language Retrieval Strategies 4.4 Cross Language Utilities 4.5 Summary 4.6 Exercises 5. EFFICIENCY 5.1 Inverted Index 5.2 Query Processing 5.3 Signature Files 5.4 Duplicate Document Detection 5.5 Summary 5.6 Exercises 6. INTEGRATING STRUCTURED DATA AND TEXT 6.1 Review of the Relational Model 6.2 A Historical Progression 6.3 Information Retrieval as a Relational Application 6.4 Semi-Structured Search using a Relational Schema 6.5 Multi-dimensional Data Model 6.6 Mediators 6.7 Summary 6.8 Exercises 7. PARALLEL INFORMATION RETRIEVAL 7.1 Parallel Text Scanning 7.2 Parallel Indexing 7.3 Clustering and Classification 7.4 Large Parallel Systems 7.5 Summary 7.6 Exercises 8. DISTRIBUTED INFORMATION RETRIEVAL 8.1 A Theoretical Model of Distributed Retrieval 8.2 Web Search 8.3 Result Fusion 8.4 Peer-to-Peer Information Systems 8.5 Other Architectures 8.6 Summary 8.7 Exercises 9. SUMMARY AND FUTURE DIRECTIONS References Index
摘要
3.4.1 D'Amore and Mah Initial information retrieval research focused on n-grams as presented in[D'Amore and Mah, 1985]. The motivation behind their work was the fact thatit is difficult to develop mathematical models for terms since the potential fora term that has not been seen before is infinite. With n-grams, only a fixednumber of n-grams can exist for a given value of n. A mathematical modelwas developed to estimate the noise in indexing and to determine appropriatedocument similarity measures. D'Amore and Mah's method replaces terms with n-grams in the vector spacemodel. The only remaining issue is computing the weights for each n-gram.Instead of simply using n-gram frequencies, a scaling method is used to nor-malize the length of the document. D'Amore and Mah's contention was that alarge document contains more n-grams than a small document, so it should bescaled based on its length. To compute the weights for a given n-gram, D'Amore and Mah estimatedthe number of occurrences of an n-gram in a document. The first simplifyingassumption was that n-grams occur with equal likelihood and follow a binomialdistribution. Hence, it was no more likely for n-gram "ABC" to occur than"DEE" The Zipfian distribution that is widely accepted for terms is not true forn-grams. D'Amore and Mah noted that n-grams are not equally likely to occur,but the removal of frequently occurring terms from the document collectionresulted in n-grams that follow a more binomial distribution than the terms. D'Amore and Mah computed the expected number of occurrences of an n-gram in a particular document. This is the product of the number of n-gramsin the document (the document length) and the probability that the n-gramoccurs. The n-gram's probability of occurrence is computed as the ratio ofits number of occurrences to the total number of n-grams in the document.D'Amore and Mah continued their application of the bino ……