Abstract: Experiments often result in the need to compare an observation against a reference, where observation and reference are selections made from some specified domain. The goal is to determine how close the observation is to the ideal result represented by the reference, so that, all other things being equal, systems that achieve outputs closer to the ideal reference can be preferred for deployment. Both observation and reference might be sets of items, or might be ordered sequences (rankings) of items. There are thus four possible combinations between sets and rankings. Three of those possibilities are already familiar to IR researchers, and have received detailed exploration. Here we consider the fourth combination, that of comparing an observation set relative to a reference ranking. We introduce a new measurement that we call rank-biased recall to cover this scenario, and demonstrate its usefulness with a case study from multi-phase ranking. We also present a new top-weighted “ranking compared to ranking” measurement, and show that it represents a complementary assessment to the previous rank-biased overlap mechanism, and possesses distinctive characteristics.
PDF CiteAbstract: A lot of recent work has focused on sparse learned indexes that use deep neural architectures to significantly improve retrieval quality while keeping the efficiency benefits of the inverted index. While such sparse learned structures achieve effectiveness far beyond those of traditional inverted index-based rankers, there is still a gap in effectiveness to the best dense retrievers, or even to sparse methods that leverage more expensive optimizations such as query expansion and query term weighting. We focus on narrowing this gap by revisiting and optimizing DeepImpact, a sparse retrieval approach that uses DocT5Query for document expansion followed by a BERT language model to learn impact scores for document terms. We first reinvestigate the expansion process and find that the recently proposed Doc2Query -- query filtration does not enhance retrieval quality when used with DeepImpact. Instead, substituting T5 with a fine-tuned Llama 2 model for query prediction results in a considerable improvement. Subsequently, we study training strategies that have proven effective for other models, in particular the use of hard negatives, distillation, and pre-trained CoCondenser model initialization. Our results substantially narrow the effectiveness gap with the most effective versions of SPLADE.
PDF CiteAbstract: Learned sparse retrieval systems aim to combine the effectiveness of contextualized language models with the scalability of conventional data structures such as inverted indexes. Nevertheless, the indexes generated by these systems exhibit significant deviations from the ones that use traditional retrieval models, leading to a discrepancy in the performance of existing query optimizations that were specifically developed for traditional structures. These disparities arise from structural variations in query and document statistics, including sub-word tokenization, leading to longer queries, smaller vocabularies, and different score distributions within posting lists. This paper introduces Block-Max Pruning (BMP), an innovative dynamic pruning strategy tailored for indexes arising in learned sparse retrieval environments. BMP employs a block filtering mechanism to divide the document space into small, consecutive document ranges, which are then aggregated and sorted on the fly, and fully processed only as necessary, guided by a defined safe early termination criterion or based on approximate retrieval requirements. Through rigorous experimentation, we show that BMP substantially outperforms existing dynamic pruning strategies, offering unparalleled efficiency in safe retrieval contexts and improved trade-offs between precision and efficiency in approximate retrieval tasks.
PDF CiteAbstract: We explore leveraging corpus-specific vocabularies that improve both efficiency and effectiveness of learned sparse retrieval systems. We find that pre-training the underlying BERT model on the target corpus, specifically targeting different vocabulary sizes incorporated into the document expansion process, improves retrieval quality by up to 12% while in some scenarios decreasing latency by up to 50%. Our experiments show that adopting corpus-specific vocabulary and increasing vocabulary size decreases average postings list length which in turn reduces latency. Ablation studies show interesting interactions between custom vocabularies, document expansion techniques, and sparsification objectives of sparse models. Both effectiveness and efficiency improvements transfer to different retrieval approaches such as uniCOIL and SPLADE and offer a simple yet effective approach to providing new efficiency-effectiveness trade-offs for learned sparse retrieval systems.
PDF CiteAbstract: Recent work has shown that neural retrieval models excel in text ranking tasks in a supervised setting when given large amounts of manually labeled training data. However, it remains an open question how one might train effective unsupervised retrieval models that are unequivocally better than lexical-matching baselines such as BM25. While some progress has been made in unsupervised dense retrieval models within a bi-encoder architecture, unsupervised sparse retrieval models remain unexplored. In this work, we propose BM26, to our knowledge the first such model. BM26 is trained in an unsupervised manner without the need for any human relevance judgments but keeps the same retrieval paradigm as BM25. Evaluations across multiple modern test collections, including MS MARCO, NaturalQuestions, TriviaQA, and BEIR, show that BM26 alone outperforms Contriever, the current state-of-the-art unsupervised dense retriever, in general, and performs on par with BM25. We further demonstrate two promising avenues to enhance existing lexical retrieval systems using such unsupervised lexical retrieval models: 1) Enhancing the BM25 system by hybrid with BM26 based on simple vector concatenation (that we dub "BM51"), demonstrating a significant and consistent improvement over BM25 alone, while remains unsupervised system. 2) Enhancing the supervised lexical retrieval model with improved initialization using BM26, as seen with the SPLADE model, generating significant improvements in both in-domain and zero-shot retrieval effectiveness.
PDF CiteAbstract: Large-scale search engines have become a fundamental tool to efficiently access information on the Web. Typically, users expect answers in sub-second time frames, which demands highly efficient algorithms to traverse the data structures to return the top-k results. Despite different top-k algorithms that avoid processing all postings for all query terms, finding one algorithm that performs the fastest on any query is not always possible. The fastest average algorithm does not necessarily perform the best on all queries when evaluated on a per-query basis. To overcome this challenge, we propose to combine different state-of-the-art disjunctive top-k query processing algorithms to minimize the execution time by selecting the most promising one for each query. We model the selection step as a classification problem in a machine-learning setup. We conduct extensive experimentation and compare the results against state-of-the-art baselines using standard document collections and query sets. On ClueWeb12, our proposal shows a speed-up of up to 1.20x for non-blocked index organizations and 1.19x for block-based ones. Moreover, tail latencies are reduced showing proportional improvements on average, but a resulting dramatic decrease in latency variance. Given these findings, the proposed approach can be easily applied to existing search infrastructures to speed up query processing and reduce resource consumption, positively impacting providers’ operative costs.
PDF CiteAbstract: Novel inverted index-based learned sparse ranking models provide more effective, but less efficient, retrieval performance compared to traditional ranking models like BM25. In this paper, we introduce a technique we call postings clipping to improve the query efficiency of learned representations. Our technique amplifies the benefit of dynamic pruning query processing techniques by accounting for changes in term importance distributions of learned ranking models. The new clipping mechanism accelerates top-k retrieval by up to 9.6X without any loss in effectiveness.
PDF CiteAbstract: Neural information retrieval architectures based on transformers such as BERT are able to significantly improve system effectiveness over traditional sparse models such as BM25. Though highly effective, these neural approaches are very expensive to run, making them difficult to deploy under strict latency constraints. To address this limitation, recent studies have proposed new families of learned sparse models that try to match the effectiveness of learned dense models, while leveraging the traditional inverted index data structure for efficiency. Current learned sparse models learn the weights of terms in documents and, sometimes, queries; however, they exploit different vocabulary structures, document expansion techniques, and query expansion strategies, which can make them slower than traditional sparse models such as BM25. In this work, we propose a novel indexing and query processing technique that exploits a traditional sparse model's "guidance" to efficiently traverse the index, allowing the more effective learned model to execute fewer scoring operations. Our experiments show that our guided processing heuristic is able to boost the efficiency of the underlying learned sparse model by a factor of four without any measurable loss of effectiveness.
PDF CiteAbstract: While current search engines use highly complex ranking functions with hundreds of features, they often perform an initial candidate generation step that uses a very simple ranking function to identify a limited set of promising candidates. A common approach is to use a disjunctive top-k query for this step. There are many methods for disjunctive top-k computation, but they tend to be slow for the required values of k, which are in the hundreds to thousands. We propose a new approach to safe disjunctive top-k computation that, somewhat counterintuitively, uses precomputed conjunctions of inverted lists to speed up disjunctive queries. The approach is based on a generalization of the well-known MaxScore algorithm, and utilizes recent improvements in threshold estimation techniques as well as new ideas to obtain significant improvements in performance. Our algorithms are implemented as an extension of the PISA framework for search-engine query processing, and available as open-source to support replication and follow-up work.
PDF CiteAbstract: Text retrieval using bags of words is typically formulated as inner products between vector representations of queries and documents, realized in query evaluation algorithms that traverse postings in an inverted index. Viewed in database terms, this captures a tight coupling between the ‘logical’ aspects of ranking (i.e., term weighting) and the ‘physical’ aspects of ranking (query evaluation). We argue that explicitly decoupling these two aspects offers a framework for thinking about the relationship between sparse retrieval techniques and the rapidly growing literature on dense retrieval techniques.
PDF CiteAbstract: Neural information retrieval systems typically use a cascading pipeline, in which a first-stage model retrieves a candidate set of documents and one or more subsequent stages re-rank this set using contextualized language models such as BERT. In this paper, we propose DeepImpact, a new document term-weighting scheme suitable for efficient retrieval using a standard inverted index. Compared to existing methods, DeepImpact improves impact-score modeling and tackles the vocabulary-mismatch problem. In particular, DeepImpact leverages DocT5Query to enrich the document collection and, using a contextualized language model, directly estimates the semantic importance of tokens in a document, producing a single-value representation for each token in each document. Our experiments show that DeepImpact significantly outperforms prior first-stage retrieval approaches by up to 17% on effectiveness metrics w.r.t. DocT5Query, and, when deployed in a re-ranking scenario, can reach the same effectiveness of state-of-the-art approaches with up to 5.1x speedup in efficiency.
PDF CiteAbstract: A lot of research has focused on the efficiency of search enginequery processing, and in particular on disjunctive top-k queries that return the highest scoring k results that contain at least one of the query terms. Disjunctive top-k queries over simple ranking functions are commonly used to retrieve an initial set of candidate results that are then reranked by more complex, often machine-learned rankers. Many optimized top-k algorithms have been proposed, including MaxScore, WAND, BMW, and JASS. While the fastest methods achieve impressive results on top-10 and top-100 queries, they tend to become much slower for the larger k commonly used for candidate generation. In this paper, we focus on disjunctive top-k queries for larger k. We propose new algorithms that achieve much faster query processing for values of k up to thousands or tens of thousands. Our algorithms build on top of the live-block filtering approach of Dimopoulos et al, and exploit the SIMD capabilities of modern CPUs. We also perform a detailed experimental comparison of our methods with the fastest known approaches, and release a full model implementation of our methods and of the underlying live-block mechanism, which will allows others to design and experiment with additional methods under the live-block approach.
PDF CiteAbstract: Feature engineering is a fundamental but poorly documented component in learning-to-rank (LTR) search engines. Such features are commonly used to construct learning models for web and product search engines, recommender systems and question-answering tasks. In each of these domains, there is a growing interest in the creation of open-access test collections that promote reproducible research. However, there are still few open source software packages capable of extracting high-quality machine learning features from large text collections. Instead, most feature-based LTR research relies on "canned" test collections, which often do not expose critical details about the underlying collection or implementation details of the features extracted. Both of these are crucial to collection creation and deployment of a search engine into production. So in this regard, the experiments are rarely reproducible with new features or collections, or helpful for companies wishing to deploy LTR systems. In this paper, we introduce Fxt, an open-source framework to perform efficient and scalable feature extraction. Fxt can easily be integrated into complex, high-performance software applications to help solve a wide variety of textual machine learning problems. To demonstrate the software's utility, we build and document a reproducible feature extraction pipeline and show how to recreate several common LTR experiments using the ClueWeb09B collection. Researchers and practitioners can benefit from Fxt to extend their machine learning pipelines for various text-based retrieval tasks, and learn how some static document features and query-specific features are implemented.
PDF CiteAbstract: In the top-k threshold estimation problem, given a query q, the goal is to estimate the score of the result at rank k. A good estimate of this score can result in significant performance improvements for several query processing scenarios, including selective search, index tiering, and widely used disjunctive query processing algorithms such as MaxScore, WAND, and BMW. Several approaches have been proposed, including parametric approaches, methods using random sampling, and a recent approach based on machine learning. However, previous work fails to perform any experimental comparison between these approaches. In this paper, we address this issue by reimplementing four major approaches and comparing them in terms of estimation error, running time, likelihood of an overestimate, and end-to-end performance when applied to common classes of disjunctive top-k query processing algorithms.
PDF CiteAbstract: There exists a natural tension between encouraging a diverse ecosystem of open-source search engines and supporting fair, replicable comparisons across those systems. To balance these two goals, we examine two approaches to providing interoperability between the inverted indexes of several systems. The first takes advantage of internal abstractions around index structures and building wrappers that allow one system to directly read the indexes of another. The second involves sharing indexes across systems via a data exchange specification that we have developed, called the Common Index File Format (CIFF). We demonstrate the first approach with the Java systems Anserini and Terrier, and the second approach with Anserini, JASSv2, OldDog, PISA, and Terrier. Together, these systems provide a wide range of implementations and features, with different research goals. Overall, we recommend CIFF as a low-effort approach to support independent innovation while enabling the types of fair evaluations that are critical for driving the field forward.
PDF CiteAbstract: An inverted index is the basic data structure used in most current large-scale information retrieval systems. It can be modeled as a collection of sorted sequences of integers. Many compression techniques for inverted indexes have been studied in the past, with some of them reaching tremendous decompression speeds through the use of SIMD instructions available on modern CPUs. While there has been some work on query processing algorithms for Graphics Processing Units (GPUs), little of it has focused on how to efficiently access compressed index structures, and we see some potential for significant improvements in decompression speed. In this paper, we describe and implement two encoding schemes for index decompression on GPU architectures. Their format and decoding algorithm is adapted from existing CPU-based compression methods to exploit the execution model and memory hierarchy offered by GPUs. We show that our solutions, GPU-BP and GPU-VByte, achieve significant speedups over their already carefully optimized CPU counterparts.
PDF CiteAbstract: Over the past few decades, the IR community has been making a continuous effort to improve the efficiency of search in large collections of documents. Query processing is still one of the main bottlenecks in large-scale search systems. The top-k document retrieval problem, which can be defined as reporting the k most relevant documents from a collection for a given query, can be extremely expensive, as it involves scoring large amounts of documents. In this work, we investigate the top-k document retrieval problem from several angles with the aim of improving the efficiency of this task in large-scale search systems. Finally, we briefly describe our initial findings and conclude by proposing future directions to follow.
PDF CiteAbstract: Performant Indexes and Search for Academia (PISA) is an experimental search engine that focuses on efficient implementations of state-of-the-art representations and algorithms for text retrieval. In this work, we outline our effort in creating a replicable search run from PISA for the 2019 Open Source Information Retrieval Replicability Challenge, which encourages the information retrieval community to produce replicable systems through the use of a containerized, Docker-based infrastructure. We also discuss the origins, current functionality, and future direction and challenges for the PISA system.
PDF CiteAbstract: In the last two decades, the IR community has seen numerous advances in top-k query processing and inverted index compression techniques. While newly proposed techniques are typically proposed against a few baselines, these evaluations are often very limited, and we feel that there is no clear overall picture on the best choices of algorithms and compression methods. In this paper, we attempt to address this issue by evaluating a number of state-of-the-art index compression methods and safe disjunctive DAAT query processing algorithms. Our goal is to understand how much index compression performance impacts overall query processing speeds, how the choice of query processing algorithm depends on the compression method used, and how performance is impacted by document reordering techniques and the number of results returned, keeping in mind that current search engines typically use sets of hundreds or thousands of candidates for further reranking.
PDF CiteAbstract: Document reordering is an important but often overlooked preprocessing stage in index construction. Reordering document identifiers in graphs and inverted indexes has been shown to reduce storage costs and improve processing efficiency in the resulting indexes. However, surprisingly few document reordering algorithms are publicly available despite their importance. A new reordering algorithm derived from recursive graph bisection was recently proposed by Dhulipala et al., and shown to be highly effective and efficient when compared against other state-of-the-art reordering strategies. In this work, we present a reproducibility study of this new algorithm. We describe both the implementation challenges faced, as well as the performance characteristics of our clean-room reimplementation. We show that we are able to successfully reproduce the core results of the original paper, and show that the algorithm generalizes to other collections and indexing frameworks. Furthermore, we make our implementation publicly available to help promote further research in this space.
PDF CiteAbstract: One of the major problems for modern search engines is to keep up with the tremendous growth in the size of the web and the number of queries submitted by users. The amount of data being generated today can only be processed and managed with specialized technologies. BlockMaxWAND and the more recent Variable BlockMaxWAND represent the most advanced query processing algorithms that make use of dynamic pruning techniques, which allow them to retrieve the top k most relevant documents for a given query without any effectiveness degradation of its ranking. In this paper, we describe a new technique for the BlockMaxWAND family of query processing algorithm, which improves block skipping in order to increase its efficiency. We show that our optimization is able to improve query processing speed on short queries by up to 37% with negligible additional space overhead.
PDF CiteAbstract: Identifying the stance of a news article body with respect to a certain headline is the first step to automated fake news detection. In this paper, we introduce a 2-stage ensemble model to solve the stance detection task. By using only hand-crafted features as input to a gradient boosting classifier, we are able to achieve a score of 9161.5 out of 11651.25 (78.63%) on the official Fake News Challenge (Stage 1) dataset. We identify the most useful features for detecting fake news and discuss how sampling techniques can be used to improve recall accuracy on a highly imbalanced dataset.
PDF CiteAbstract: Query processing is one of the main bottlenecks in large-scale search engines. Retrieving the top k most relevant documents for a given query can be extremely expensive, as it involves scoring large amounts of documents. Several dynamic pruning techniques have been introduced in the literature to tackle this problem, such as BlockMaxWAND, which splits the inverted index into constant- sized blocks and stores the maximum document-term scores per block; this information can be used during query execution to safely skip low-score documents, producing many-fold speedups over exhaustive methods. We introduce a refinement for BlockMaxWAND that uses variable- sized blocks, rather than constant-sized. We set up the problem of deciding the block partitioning as an optimization problem which maximizes how accurately the block upper bounds represent the underlying scores, and describe an efficient algorithm to find an approximate solution, with provable approximation guarantees. rough an extensive experimental analysis we show that our method significantly outperforms the state of the art roughly by a factor 2×. We also introduce a compressed data structure to represent the additional block information, providing a compression ratio of roughly 50%, while incurring only a small speed degradation, no more than 10% with respect to its uncompressed counterpart.
PDF Cite