arXiv:2007.07846v1 [cs.IR] 14 Jul 2020

Covidex: Neural Ranking Models and Keyword Search Infrastructurefor the COVID-19 Open Research Dataset

Edwin Zhang,1 Nikhil Gupta,1 Raphael Tang,1 Xiao Han,1 Ronak Pradeep,1 Kuang Lu,2Yue Zhang,2 Rodrigo Nogueira,1 Kyunghyun Cho,3,4 Hui Fang,2 and Jimmy Lin1

1 University of Waterloo 2 University of Delaware3 New York University 4 CIFAR Associate Fellow

Abstract

We present Covidex, a search engine thatexploits the latest neural ranking models toprovide information access to the COVID-19Open Research Dataset curated by the AllenInstitute for AI. Our system has been onlineand serving users since late March 2020. TheCovidex is the user application component ofour three-pronged strategy to develop tech-nologies for helping domain experts tackle theongoing global pandemic. In addition, we pro-vide robust and easy-to-use keyword search in-frastructure that exploits mature fusion-basedmethods as well as standalone neural rankingmodels that can be incorporated into other ap-plications. These techniques have been evalu-ated in the ongoing TREC-COVID challenge:Our infrastructure and baselines have beenadopted by many participants, including someof the highest-scoring runs in rounds 1, 2,and 3. In round 3, we report the highest-scoring run that takes advantage of previoustraining data and the second-highest fully au-tomatic run.

1 Introduction

As a response to the worldwide COVID-19 pan-demic, on March 13, 2020, the Allen Institutefor AI (AI2) released the COVID-19 Open Re-search Dataset (CORD-19).1 With regular updatessince the initial release (first weekly, then daily),the corpus contains around 188,000 scientific ar-ticles (as of July 12, 2020), including most withfull text, about COVID-19 and coronavirus-relatedresearch more broadly (for example, SARS andMERS). These articles are gathered from a varietyof sources, including PubMed, a curated list of arti-cles from the WHO, as well as preprints from arXiv,bioRxiv, and medRxiv. The goal of the effort is“to mobilize researchers to apply recent advances

1www.semanticscholar.org/cord19

in natural language processing to generate new in-sights in support of the fight against this infectiousdisease.” We responded to this call to arms.

As motivation, we believe that information ac-cess capabilities (search, question answering, etc.)can be applied to provide users with high-qualityinformation from the scientific literature, to in-form evidence-based decision making and to sup-port insight generation. Examples include publichealth officials assessing the efficacy of wearingface masks, clinicians conducting meta-analysesto update care guidelines based on emerging stud-ies, and virologist probing the genetic structure ofCOVID-19 in search of vaccines. We hope to con-tribute to these efforts via a three-pronged strategy:

1. Despite significant advances in the applicationof neural architectures to text ranking, keywordsearch (e.g., with “bag of words” queries) re-mains an important core technology. Buildingon top of our Anserini IR toolkit (Yang et al.,2018), we have released robust and easy-to-useopen-source keyword search infrastructure thatthe broader community can build on.

2. Leveraging our own infrastructure, we exploredthe use of sequence-to-sequence transformermodels for text ranking, combined with a sim-ple classification-based feedback approach toexploit existing relevance judgments. We havealso open sourced all these models, which canbe integrated into other systems.

3. Finally, we package the previous two compo-nents into Covidex, an end-to-end search engineand browsing interface deployed at covidex.ai,initially described in Zhang et al. (2020a).

All three efforts have been successful. In the on-going TREC-COVID challenge, our infrastructureand baselines have been adopted by many teams,which in some cases have submitted runs that

arX

iv:2

007.

0784

6v1

[cs

.IR

] 1

4 Ju

l 202

0

www.semanticscholar.org/cord19

covidex.ai

scored higher than our own submissions. This il-lustrates the success of our infrastructure-buildingefforts (1). In the latest round 3 results, we reportthe highest-scoring run that exploits relevance judg-ments in a user feedback setting and the second-highest fully automatic run, affirming the qualityof our own ranking models (2). Finally, usagestatistics offer some evidence for the success of ourdeployed Covidex search engine (3).

2 Ranking Components

Multi-stage search architectures represent themost common design for modern search en-gines, with work in academia dating back over adecade (Matveeva et al., 2006; Wang et al., 2011;Asadi and Lin, 2013). Known production deploy-ments of this design include the Bing web searchengine (Pedersen, 2010) as well as Alibaba’s e-commerce search engine (Liu et al., 2017).

The idea behind multi-stage ranking is straight-forward: instead of a monolithic ranker, rankingis decomposed into a series of stages. Typically,the pipeline begins with an initial retrieval stage,most often using bag-of-words queries against aninverted index. One or more subsequent stagesreranks and refines the candidate set successivelyuntil the final results are presented to the user. Themulti-stage design provides a clean interface be-tween keyword search, neural reranking models,and the user application.

This section details individual components inour architecture. We describe later how these build-ing blocks are assembled in the deployed system(Section 3) and for TREC-COVID (Section 4.2).

2.1 Keyword Search

In our design, initial retrieval is performed by theAnserini IR toolkit (Yang et al., 2017, 2018),2

which we have been developing for several yearsand powers a number of our previous systemsthat incorporate various neural architectures (Yanget al., 2019; Yilmaz et al., 2019). Anserini rep-resents an effort to better align real-world searchapplications with academic information retrievalresearch: under the covers, it builds on the popularand widely-deployed open-source Lucene searchlibrary, on top of which we provide a number ofmissing features for conducting research on mod-ern IR test collections.

2anserini.io

Anserini provides an abstraction for documentcollections, and comes with a variety of adaptorsfor different corpora and formats: web pages inWARC containers, XML documents in tarballs,JSON objects in text files, etc. Providing key-word search capabilities over CORD-19 requiredonly writing an adaptor for the corpus that allowsAnserini to ingest the documents.

An issue that immediately arose with CORD-19 concerns the granularity of indexing, i.e., whatshould we consider to be a “document” as the“atomic unit” of indexing and retrieval? One com-plication is that the corpus contains a mix of arti-cles that vary widely in length, not only in termsof natural variations (scientific articles of varyinglengths, book chapters, etc.), but also because thefull text is not available for some articles. It iswell known in the IR literature, dating back sev-eral decades (e.g., Singhal et al. 1996), that lengthnormalization plays an important role in retrievaleffectiveness.

Guided by previous work on searching full-textarticles (Lin, 2009), we explored three separateindexing schemes:

• An index comprised of only titles and abstracts.

• An index comprised of each full-text article as asingle, individual document; articles without fulltext contained only titles and abstracts.

• A paragraph-level index structured as follows:each full-text article is segmented into para-graphs and for each paragraph, we created a“document” comprising the title, abstract, andthat paragraph. The title and abstract alone com-prised an additional “document”. Thus, a full-text article with n paragraphs yields n+ 1 sepa-rate retrieval units in the index.

To be consistent with standard IR parlance, wecall each of these retrieval units a document, ina generic sense, despite their composite structure.Following best practice, documents are ranked us-ing BM25 (Robertson et al., 1994). The relativeeffectiveness of each indexing scheme, however, isan empirical question.

With the paragraph index, a query is likely toretrieve multiple paragraphs from the same under-lying article; since the final task is to rank articles,we take the highest-scoring paragraph across allretrieved results to produce a final ranking. Fur-thermore, we can combine these multiple represen-tations to capture different ranking signals using

anserini.io

fusion techniques, which further improves effec-tiveness; see Section 4.2 for details.

Since Anserini is built on top of Lucene, whichis implemented in Java, it is designed to run onthe Java Virtual Machine (JVM). However, Tensor-Flow (Abadi et al., 2016) and PyTorch (Paszkeet al., 2019), the two most popular neural networktoolkits today, use Python as their main language.More broadly, with its diverse and mature ecosys-tem, Python has emerged as the language of choicefor most data scientists today. Anticipating this gap,we have been working on Pyserini,3 Python bind-ings for Anserini, since late 2019 (Yilmaz et al.,2020). Pyserini is released as a well-documented,easy-to-use Python module distributed via PyPIand easily installable via pip.4

Putting everything together, we provide the com-munity keyword search infrastructure by sharingcode, indexes, as well as baseline runs. First, all ourcode is available open source. Second, we shareregularly updated pre-built versions of CORD-19indexes, so that users can replicate our results withminimal effort. Finally, we provide baseline runsfor TREC-COVID that can be directly incorporatedinto other participants’ submissions.

2.2 RerankersIn our infrastructure, the output of Pyserini is fedto rerankers that aim to improve ranking quality.We describe three different approaches: two arebased on neural architectures, and the third exploitsrelevance judgments in a feedback setting using aclassification approach.monoT5. Despite the success of BERT for docu-ment ranking (Dai and Callan, 2019; MacAvaneyet al., 2019; Yilmaz et al., 2019), there is evidencethat ranking with sequence-to-sequence models canachieve even better effectiveness, particularly inzero-shot and other settings with limited trainingdata (Nogueira et al., 2020), such as for TREC-COVID. Our “base” reranker, called monoT5, isbased on T5 (Raffel et al., 2019).

Given a query q and a set of candidate documentsD from Pyserini, for each d ∈ D we construct thefollowing input sequence to feed into our model:

Query: q Document: d Relevant: (1)

The model is fine-tuned to produce either “true” or“false” depending on whether the document is rele-

3pyserini.io4pypi.org/project/pyserini/

vant or not to the query. That is, “true” and “false”are the ground truth predictions in the sequence-to-sequence task, what we call the “target words”.

At inference time, to compute probabilities foreach query–document pair, we apply softmax onlyto the logits of the “true” and “false” tokens.We rerank the candidate documents accordingto the probabilities assigned to the “true” token.See Nogueira et al. (2020) for additional detailsabout this logit normalization trick and the effectsof different target words.

Since in the beginning we did not havetraining data specific to COVID-19, we fine-tuned our model on the MS MARCO passagedataset (Nguyen et al., 2016), which comprises8.8M passages obtained from the top 10 results re-trieved by the Bing search engine (based on around1M queries). The training set contains approxi-mately 500k pairs of query and relevant documents,where each query has one relevant passage on aver-age; non-relevant documents for training are alsoprovided as part of the training data. Nogueiraet al. (2020) and Yilmaz et al. (2019) have both pre-viously demonstrated that models trained on MSMARCO can be directly applied to other documentranking tasks.

We fine-tuned our monoT5 model with a con-stant learning rate of 10−3 for 10k iterations withclass-balanced batches of size 128. We used a maxi-mum of 512 input tokens and one output token (i.e.,either “true” or “false”, as described above). In theMS MARCO passage dataset, none of the inputsrequired truncation when using this length limit.Training variants based on T5-base and T5-3B tookapproximately 4 and 40 hours, respectively, on asingle Google TPU v3-8.

At inference time, since output from Pyserini isusually longer than the length restrictions of themodel, it is not possible to feed the entire text intoour model at once. To address this issue, we firstsegment each document into spans by applying asliding window of 10 sentences with a stride of 5.We obtain a probability of relevance for each spanby performing inference on it independently, andthen select the highest probability among the spansas the relevance score of the document.

duoT5. A pairwise reranker estimates the proba-bility si,j that candidate di is more relevant than djfor query q, where i 6= j. Nogueira et al. (2019)demonstrated that a pairwise BERT reranker run-ning on the output of a pointwise BERT reranker

pyserini.io

pypi.org/project/pyserini/

yields statistically significant improvements inranking metrics. We applied the same intuitionto T5 in a pairwise reranker called duoT5, whichtakes as input the sequence:

Query: q Document0: di Document1: dj Relevant:

where di and dj are unique pairs of candidates fromthe set D. The model is fine-tuned to predict “true”if candidate di is more relevant than dj to query qand “false” otherwise. We fine-tuned duoT5 usingthe same hyperparameters as monoT5.

At inference time, we use the top 50 highestscoring documents according to monoT5 as ourcandidates {di}. We then obtain probabilities pi,jof di being more relevant than dj for all unique can-didate pairs {di, dj}, ∀i 6= j. Finally, we computea single score si for candidate di as follows:

si =∑j∈Ji

(pi,j + (1− pj,i)) (2)

where Ji = {0 ≤ j < 50, j 6= i}. Based onexploratory studies on the MS MARCO passagedataset, this setting leads to the most stable andeffective rankings.Relevance Feedback. The setup of TREC-COVID(see Section 4.1) provides a feedback setting wheresystems can exploit a limited number of relevancejudgments on a per-query basis. How do wetake advantage of such training data? Despitework on fine-tuning transformers in a few-shot set-ting (Zhang et al., 2020b; Lee et al., 2020), wewere wary of the dangers of overfitting on limiteddata, particularly since there is little guidance onrelevance feedback using transformers in the litera-ture. Instead, we implemented a robust approachthat treats relevance feedback as a document clas-sification problem using simple linear classifiers,described in Yu et al. (2019) and Lin (2019).

The approach is conceptually simple: for eachquery, we train a linear classifier (logistic regres-sion) that attempts to distinguish relevant from non-relevant documents for that query. The classifieroperates on sparse bag-of-words representations us-ing tf–idf term weighting. At inference time, eachcandidate document is fed to the classifier, and theclassifier score is then linearly interpolated with theoriginal candidate document score to produce a fi-nal score. We describe the input source documentsin Section 4.2.

All components above have also been open sourced.The two neural reranking modules are available in

Figure 1: Screenshot of the Covidex.

PyGaggle,5 which is our recently developed neu-ral ranking library designed to work with Pyserini.Our classification-based approach to feedback isimplemented in Pyserini directly. These compo-nents are available for integration into any system.

3 The Covidex

Beyond sharing our keyword search infrastructureand reranking models, we’ve built the Covidex asan operational search engine to demonstrate ourcapabilities to domain experts who are not inter-ested in individual components. As deployed, weuse the paragraph index and monoT5-base as thereranker. An additional highlighting module basedon BioBERT is described in Zhang et al. (2020a).To decrease end-to-end latency, we rerank only thetop 96 documents per query and truncate rerankerinput to a maximum of 256 tokens.

The Covidex is built using the FastAPI Pythonweb framework, where all incoming API requestsare handled by a service that performs searching,reranking, and text highlighting. Search is per-formed with Pyserini (Section 2.1), and the resultsare then reranked with PyGaggle (Section 2.2). Thefrontend (which is also open source) is built withReact to support the use of modular, declarativeJavaScript components,6 taking advantage of itsvast ecosystem.

A screenshot of our system is shown in Figure 1.Covidex provides standard search capabilities, ei-ther based on keyword queries or natural-languageinput. Users can click “Show more” to reveal theabstract as well as excerpts from the full text, where

5pygaggle.ai6reactjs.org

pygaggle.ai

reactjs.org

potentially relevant passages are highlighted. Click-ing on the title brings the user to the article’s sourceon the publisher’s site. In addition, we have imple-mented a faceted browsing feature. From CORD-19, we were able to easily expose facets correspond-ing to dates, authors, journals, and sources. Nav-igating by year, for example, allows a user to fo-cus on older coronavirus research (e.g., on SARS)or the latest research on COVID-19, and a com-bination of the journal and source facets allowsa user to differentiate between preprints and thepeer-reviewed literature, and between venues withdifferent reputations.

The system is currently deployed across a smallcluster of servers, each with two NVIDIA V100GPUs, as our pipeline requires neural network infer-ence at query time. Each server runs the completesoftware stack in a simple replicated setup (no par-titioning). On top of this, we leverage Cloudflareas a simple load balancer, which uses a round robinscheme to dispatch requests across the differentservers. The end-to-end latency for a typical queryis around two seconds.

The first implementation of our system was de-ployed in late March, and we have been incremen-tally adding features since. Based on Cloudflarestatistics, our site receives around two hundredunique visitors per day and the site serves morethan one thousand requests each day. Of course,usage statistics were (up to several times) higherwhen we first launched due to publicity on socialmedia. However, the figures cited above representa “steady state” that has held up over the past fewmonths, in the absence of any deliberate promotion.

4 TREC-COVID

Reliable, large-scale evaluations of text retrievalmethods are a costly endeavour, typically beyondthe resources of individual research groups. Fortu-nately, the community-wide TREC-COVID chal-lenge sponsored by the U.S. National Institute forStandards and Technology (NIST) provides a fo-rum for evaluating our techniques.

4.1 Evaluation Overview

The TREC-COVID challenge, which began in mid-April and is still ongoing, provides an opportunityfor researchers to study methods for quickly stand-ing up information access systems, both in responseto the current pandemic and to prepare for similarfuture events.

Both out of logistic necessity in evaluation de-sign and because the body of scientific literature israpidly expanding, TREC-COVID is organized intoa series of “rounds”, each of which use the CORD-19 collection at a snapshot in time. For a particularround, participating teams develop systems thatreturn results to a number of information needs,called “topics”—one example is “serological teststhat detect antibodies of COVID-19”. These resultscomprise a run or a submission. NIST then gathers,organizes, and evaluates these runs using a standardpooling methodology (Voorhees, 2002).

The product of each round is a collection of rele-vance judgments, which are annotations by domainexperts about the relevance of documents with re-spect to topics. On average, there are around 300judgments (both positive and negative) per topicfrom each round. These relevance judgments areused to evaluate the effectiveness of systems (pop-ulating a leaderboard) and can also be used to trainmachine-learning models in future rounds. Runsthat take advantage of these relevance judgmentsare known as “feedback runs”, in contrast to “auto-matic” runs that do not. A third category, “manual”runs, can involve human input, but we did not sub-mit any such runs.

Currently, TREC-COVID has completed round3 and is in the middle of round 4. We present eval-uation results from rounds 1, 2, and 3, since resultsfrom round 4 are not yet available. Each roundcontains a number of topics that are persistent (i.e.,carryover from previous rounds) as well as newtopics. To avoid retrieving duplicate documents,the evaluation adopts a residual collection method-ology, where judged documents (either relevantor not) from previous rounds are automatically re-moved from consideration. Thus, for each topic,future rounds only evaluate documents that havenot been examined before (either newly publishedarticles or have never been retrieved). Note thatdue to the evaluation methodology, scores acrossrounds are not comparable.

4.2 Results

A selection of results from TREC-COVID areshown in Table 1, where we report standard metricscomputed by NIST. We submitted runs under team“covidex” (for neural models) and team “anserini”(for our bag-of-words baselines).

In Round 1, there were 143 runs from 56 teams.Our best run T5R1 used BM25 for first-stage re-

Team Run Type nDCG@10 P@5 mAP

Round 1: 30 topicssabir sabir.meta.docs automatic 0.6080 0.7800 0.3128GUIR S2 run2† automatic 0.6032 0.6867 0.2601covidex T5R1 (= monoT5) automatic 0.5223 0.6467 0.2838

Round 2: 35 topicsmpiid5 mpiid5 run3† manual 0.6893 0.8514 0.3380CMT SparseDenseSciBert† feedback 0.6772 0.7600 0.3115GUIR S2 GUIR S2 run1† automatic 0.6251 0.7486 0.2842covidex covidex.t5 (= monoT5) automatic 0.6250 0.7314 0.2880anserini r2.fusion2 automatic 0.5553 0.6800 0.2725anserini r2.fusion1 automatic 0.4827 0.6114 0.2418

Round 3: 40 topicscovidex r3.t5 lr feedback 0.7740 0.8600 0.3333BioinformaticsUA BioInfo-run1 feedback 0.7715 0.8650 0.3188SFDC SFDC-fus12-enc23-tf3† automatic 0.6867 0.7800 0.3160covidex r3.duot5 (= monoT5 + duoT5) automatic 0.6626 0.7700 0.2676covidex r3.monot5 (= monoT5) automatic 0.6596 0.7800 0.2635anserini r3.fusion2 automatic 0.6100 0.7150 0.2641anserini r3.fusion1 automatic 0.5359 0.6100 0.2293

Table 1: Selected TREC-COVID results. Our submissions are under teams “covidex” and “anserini”. All runsnotated with † incorporate our infrastructure components in some way.

trieval using the paragraph index followed by ourmonoT5-3B reranker, trained on MS MARCO (asdescribed in Section 2.2). The best automatic neu-ral run was run2 from team GUIR S2 (MacAvaneyet al., 2020), which was built on Anserini. Thisrun placed second behind the best automatic run,sabir.meta.docs, which interestingly was basedon the vector-space model.

While we did make meaningful infrastructurecontributions (e.g., Anserini provided the keywordsearch results that fed the neural ranking modelsof team GUIR S2), our own run T5R1 was substan-tially behind the top-scoring runs. A post-hoc ex-periment with round 1 relevance judgments showedthat using the paragraph index did not turn out to bethe best choice: simply replacing with the abstractindex (but retaining the monoT5-3B reranker) im-proved nDCG@10 from 0.5223 to 0.5702.7

We learned two important lessons from the re-sults of round 1:

1. The effectiveness of simple rank fusion tech-niques that can exploit diverse ranking signalsby combining multiple ranked lists. Many teamsadopted such techniques (including the top-scoring run), which proved both robust and ef-fective. This is not a new observation in infor-7Despite this finding, we suspect that there may be evalua-

tion artifacts at play here, because our impressions from thedeployed system suggest that results from the paragraph indexare better. Thus, the deployed Covidex still uses paragraphindexes.

mation retrieval, but is once again affirmed byTREC-COVID.

2. The importance of building the “right” queryrepresentations for keyword search. EachTREC-COVID topic contains three fields: query,question, and narrative. The query field de-scribes the information need using a few key-words, similar to what a user would type into aweb search engine. The question field phrasesthe information need as a well-formed naturallanguage question, and the narrative field con-tains additional details in a short paragraph. Thequery field may be missing important keywords,but the other two fields often contain too many“noisy” terms unrelated to the information need.

Thus, it makes sense to leverage informationfrom multiple fields in constructing keywordqueries, but to do so selectively. Based on re-sults from round 1, the following query genera-tion technique proved to be effective: when con-structing the keyword query for a given topic,we take the non-stopwords from the query fieldand further expand them with terms belongingto named entities extracted from the questionfield using ScispaCy (Neumann et al., 2019).

We saw these two lessons as an opportunity tofurther contribute community infrastructure, andstarting in round 2 we made two fusion runs fromAnserini freely available: fusion1 and fusion2.

In both runs, we combined rankings from the ab-stract, full-text, and paragraph indexes via recip-rocal rank fusion (RRF) (Cormack et al., 2009).The runs differed in their treatment of the queryrepresentation. The run fusion1 simply took thequery field from the topics as the basis for key-word search, while run fusion2 incorporated thequery generator described above to augment thequery representation with key phrases. These runswere made available before the deadline so thatother teams could use them, and indeed many tookadvantage of them.

In Round 2, there were 136 runs from 51 teams.Our two Anserini baseline fusion runs are shownas r2.fusion1 and r2.fusion2 in Table 1. Com-paring these two fusion baselines, we see that ourquery generation approach yields a large gain ineffectiveness. Ablation studies further confirmedthat ranking signals from the different indexes docontribute to the overall higher effectiveness of therank fusion runs. That is, the effectiveness of thefusion results is higher than results from any of theindividual indexes.

Our covidex.t5 run takes r2.fusion1 andr2.fusion2, reranks both with monoT5-3B, andthen combines (with RRF) the outputs of both. ThemonoT5-3B model was fine-tuned on MS MARCOthen fine-tuned (again) on a medical subset of MSMARCO (MacAvaney et al., 2020). This run essen-tially tied for the best automatic run GUIR S2 run1,which scored just 0.0001 higher.

As additional context, Table 1 shows the best“manual” and “automatic” runs from round 2(mpiid5 run3 and SparseDenseSciBert, respec-tively), which were also the top two runs overall.These results show that manual and feedback tech-niques can achieve quite a bit of gain over fullyautomatic techniques. Both of these runs and fourout of the five top teams in round 2 took advan-tage of the fusion baselines we provided, whichdemonstrates our impact not only in developingeffective ranking models, but also our service tothe community in providing infrastructure.

In Round 3, there were 79 runs from 31 teams.Our Anserini fusion baselines, r3.fusion1 andr3.fusion2, remained the same from the previousround and continued to provide strong baselines.

Our run r3.duot5 represents the first deploy-ment of our monoT5 and duoT5 multi-stage rerank-ing pipeline (see Section 2.2), which is a fusionof the fusion runs as the first-stage candidates,

reranked by monoT5 and then duoT5. From Ta-ble 1, we see that duoT5 does indeed improveover just using monoT5 (run r3.monot5), albeitthe gains are small (but we found that the duoT5run has more unjudged documents). The r3.duot5run ranks second among all teams under the “au-tomatic” condition, and we are about two pointsbehind team SFDC. However, according to Estevaet al. (2020), their general approach incorporatesAnserini fusion runs, which bolsters our case thatwe are providing valuable infrastructure for thecommunity.

Our own feedback run r3.t5 lr implements theclassification-based feedback technique (see Sec-tion 2.2) with monoT5 results as the input sourcedocument (with a mixing weight of 0.5 to combinemonoT5 scores with classifier scores). This wasthe highest-scoring run across all submissions (allcategories), just a bit ahead of BioInfo-run1.

5 Conclusions

Our project has three goals: build community in-frastructure, advance the state of the art in neuralranking, and provide a useful application. We be-lieve that our efforts can contribute to the fightagainst this global pandemic. Beyond COVID-19,the capabilities we’ve developed can be applied toanalyzing the scientific literature more broadly.

6 Acknowledgments

This research was supported in part by the CanadaFirst Research Excellence Fund, the Natural Sci-ences and Engineering Research Council (NSERC)of Canada, CIFAR AI & COVID-19 CatalystFunding 2019–2020, and Microsoft AI for GoodCOVID-19 Grant. We’d like to thank Kyle Lo fromAI2 for helpful discussions and Colin Raffel fromGoogle for his assistance with T5.

ReferencesMartın Abadi, Paul Barham, Jianmin Chen, Zhifeng

Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Geoffrey Irving, Michael Isard,et al. 2016. TensorFlow: A system for large-scalemachine learning. In 12th USENIX Symposiumon Operating Systems Design and Implementation(OSDI ’16), pages 265–283.

Nima Asadi and Jimmy Lin. 2013. Effective-ness/efficiency tradeoffs for candidate generation inmulti-stage retrieval architectures. In Proceedingsof the 36th Annual International ACM SIGIR Confer-ence on Research and Development in Information

Retrieval (SIGIR 2013), pages 997–1000, Dublin,Ireland.

Gordon V. Cormack, Charles L. A. Clarke, and StefanButtcher. 2009. Reciprocal rank fusion outperformsCondorcet and individual rank learning methods. InProceedings of the 32nd Annual International ACMSIGIR Conference on Research and Development inInformation Retrieval (SIGIR 2009), pages 758–759,Boston, Massachusetts.

Zhuyun Dai and Jamie Callan. 2019. Deeper text un-derstanding for IR with contextual neural languagemodeling. In Proceedings of the 42nd Annual Inter-national ACM SIGIR Conference on Research andDevelopment in Information Retrieval (SIGIR 2019),pages 985–988, Paris, France.

Andre Esteva, Anuprit Kale, Romain Paulus, KazumaHashimoto, Wenpeng Yin, Dragomir Radev, andRichard Socher. 2020. CO-Search: COVID-19information retrieval with semantic search, ques-tion answering, and abstractive summarization.arXiv:2006.09595.

Cheolhyoung Lee, Kyunghyun Cho, and Wanmo Kang.2020. Mixout: Effective regularization to finetunelarge-scale pretrained language models. In Proceed-ings of the 8th International Conference on LearningRepresentations (ICLR 2020).

Jimmy Lin. 2009. Is searching full text more effec-tive than searching abstracts? BMC Bioinformatics,10:46.

Jimmy Lin. 2019. The simplest thing that can possiblywork: pseudo-relevance feedback using text classifi-cation. In arXiv:1904.08861.

Shichen Liu, Fei Xiao, Wenwu Ou, and Luo Si. 2017.Cascade ranking for operational e-commerce search.In Proceedings of the 23rd ACM SIGKDD Inter-national Conference on Knowledge Discovery andData Mining (SIGKDD 2017), pages 1557–1565,Halifax, Nova Scotia, Canada.

Sean MacAvaney, Arman Cohan, and Nazli Gohar-ian. 2020. SLEDGE: A simple yet effective base-line for coronavirus scientific knowledge search.arXiv:2005.02365.

Sean MacAvaney, Andrew Yates, Arman Cohan, andNazli Goharian. 2019. CEDR: Contextualized em-beddings for document ranking. In Proceedings ofthe 42nd Annual International ACM SIGIR Confer-ence on Research and Development in InformationRetrieval (SIGIR 2019), pages 1101–1104, Paris,France.

Irina Matveeva, Chris Burges, Timo Burkard, AndyLaucius, and Leon Wong. 2006. High accuracy re-trieval with multiple nested ranker. In Proceedingsof the 29th Annual International ACM SIGIR Con-ference on Research and Development in Informa-tion Retrieval (SIGIR 2006), pages 437–444, Seattle,Washington.

Mark Neumann, Daniel King, Iz Beltagy, and WaleedAmmar. 2019. ScispaCy: Fast and robust mod-els for biomedical natural language processing.arXiv:1902.07669.

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao,Saurabh Tiwary, Rangan Majumder, and Li Deng.2016. MS MARCO: a human-generated machinereading comprehension dataset. arXiv:1611.09268.

Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin.2020. Document ranking with a pretrainedsequence-to-sequence model. arXiv:2003.06713.

Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, andJimmy Lin. 2019. Multi-stage document rankingwith BERT. arXiv:1910.14424.

Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, et al. 2019. PyTorch: an imperative style,high-performance deep learning library. In Ad-vances in Neural Information Processing Systems,pages 8024–8035.

Jan Pedersen. 2010. Query understanding at Bing. InIndustry Track Keynote at the 33rd Annual Interna-tional ACM SIGIR Conference on Research and De-velopment in Information Retrieval (SIGIR 2010),Geneva, Switzerland.

Colin Raffel, Noam Shazeer, Adam Roberts, KatherineLee, Sharan Narang, Michael Matena, Yanqi Zhou,Wei Li, and Peter J. Liu. 2019. Exploring the limitsof transfer learning with a unified text-to-text trans-former. In arXiv:1910.10683.

Stephen E. Robertson, Steve Walker, Susan Jones,Micheline Hancock-Beaulieu, and Mike Gatford.1994. Okapi at TREC-3. In Proceedings of the3rd Text REtrieval Conference (TREC-3), pages 109–126, Gaithersburg, Maryland.

Amit Singhal, Chris Buckley, and Mandar Mitra. 1996.Pivoted document length normalization. In Pro-ceedings of the 19th Annual International ACM SI-GIR Conference on Research and Development inInformation Retrieval (SIGIR 1996), pages 21–29,Zurich, Switzerland.

Ellen M. Voorhees. 2002. The philosophy of informa-tion retrieval evaluation. In Evaluation of Cross-Language Information Retrieval Systems: SecondWorkshop of the Cross-Language Evaluation Forum,Lecture Notes in Computer Science Volume 2406,pages 355–370.

Lidan Wang, Jimmy Lin, and Donald Metzler. 2011.A cascade ranking model for efficient ranked re-trieval. In Proceedings of the 34th Annual Inter-national ACM SIGIR Conference on Research andDevelopment in Information Retrieval (SIGIR 2011),pages 105–114, Beijing, China.

Peilin Yang, Hui Fang, and Jimmy Lin. 2017. Anserini:enabling the use of Lucene for information retrievalresearch. In Proceedings of the 40th Annual Inter-national ACM SIGIR Conference on Research andDevelopment in Information Retrieval (SIGIR 2017),pages 1253–1256, Tokyo, Japan.

Peilin Yang, Hui Fang, and Jimmy Lin. 2018. Anserini:reproducible ranking baselines using Lucene. Jour-nal of Data and Information Quality, 10(4):Article16.

Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, LuchenTan, Kun Xiong, Ming Li, and Jimmy Lin. 2019.End-to-end open-domain question answering withBERTserini. In Proceedings of the 2019 Confer-ence of the North American Chapter of the Asso-ciation for Computational Linguistics (Demonstra-tions), pages 72–77, Minneapolis, Minnesota.

Zeynep Akkalyoncu Yilmaz, Charles L. A. Clarke, andJimmy Lin. 2020. A lightweight environment forlearning experimental IR research practices. In Pro-ceedings of the 43rd Annual International ACM SI-GIR Conference on Research and Development in In-formation Retrieval (SIGIR 2020).

Zeynep Akkalyoncu Yilmaz, Wei Yang, HaotianZhang, and Jimmy Lin. 2019. Cross-domain mod-eling of sentence-level evidence for document re-trieval. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP), pages3481–3487, Hong Kong, China.

Ruifan Yu, Yuhao Xie, and Jimmy Lin. 2019. Simpletechniques for cross-collection relevance feedback.In Proceedings of the 41th European Conference onInformation Retrieval, Part I (ECIR 2019), pages397–409, Cologne, Germany.

Edwin Zhang, Nikhil Gupta, Rodrigo Nogueira,Kyunghyun Cho, and Jimmy Lin. 2020a. Rapidlydeploying a neural search engine for the COVID-19Open Research Dataset: Preliminary thoughts andlessons learned. arXiv:2004.05125.

Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q.Weinberger, and Yoav Artzi. 2020b. Revisiting few-sample BERT fine-tuning. arXiv:2006.05987.

arXiv:2007.07846v1 [cs.IR] 14 Jul 2020

Documents

Transcript of arXiv:2007.07846v1 [cs.IR] 14 Jul 2020