[arXiv] https://arxiv.org/abs/2310.20501

[Abstract] Recently, the emergence of large language models (LLMs) has revolutionized the paradigm of information retrieval (IR) applications, especially in web search. With their remarkable capabilities in generating human-like texts, LLMs have created enormous texts on the Internet. As a result, IR systems in the LLMs era are facing a new challenge: the indexed documents now are not only written by human beings but also automatically generated by the LLMs. How these LLM-generated documents influence the IR systems is a pressing and still unexplored question. In this work, we conduct a quantitative evaluation of different IR models in scenarios where both human-written and LLM-generated texts are involved. Surprisingly, our findings indicate that neural retrieval models tend to rank LLM-generated documents higher.We refer to this category of biases in neural retrieval models towards the LLM-generated text as the source bias. Moreover, we discover that this bias is not confined to the first-stage neural retrievers, but extends to the second-stage neural re-rankers. Then, we provide an in-depth analysis from the perspective of text compression and observe that neural models can better understand the semantic information of LLM-generated text, which is further substantiated by our theoretical analysis. We also discuss the potential server concerns stemming from the observed source bias and hope our findings can serve as a critical wake-up call to the IR community and beyond. To facilitate future explorations of IR in the LLM era, the constructed two new benchmarks and codes will later be available at https://github.com/KID-22/LLM4IR-Bias.

[Main Findings]

https://preview.redd.it/m3l5vvmggpxb1.png?width=893&format=png&auto=webp&s=3140d873d3e7be582ae405cb2adee03d80b16190

https://preview.redd.it/jdebc1rigpxb1.png?width=914&format=png&auto=webp&s=82f725d77010c4e17e0c558d888a6e0c943ae23d

https://preview.redd.it/bgvjv9qjgpxb1.png?width=851&format=png&auto=webp&s=3a7220e892be0cd558fd63a7a9d8e8ba5adb7da4