📌📖 LLM Citations, the right way

With the rise of Language Models (LLMs) in Question-Answering (QA) systems tackling high-stake tasks, the conventional methods of citation need an upgrade. This post explores how regular expressions (regex) can transform citation methods, using OpenAI's functionality in sync with the Pydantic library.

notion image

If you want to just look at the code you can check it out here

The Existing State of Citation in LLMs

At present, citations in LLMs follow a simple rule: matching the IDs of retrieved text chunks with the generated outputs. This can be done in one of two ways:

  1. Identifying Sources: Information is directly linked to sources. For instance, "The author attended uwaterloo and was the president of the data science club" (Source: 2, 5, 23).
  1. In-text Citations: Source information is embedded within the text. Example: "The author attended uwaterloo[2,4] and was the president of the data science club[23]."

Although this approach enables a direct connection between the cited information and the source, it also requires scanning the entire source text for verification, making the process cumbersome.

notion image

Citation Challenges: Chunk Size Matters

Relying solely on chunk IDs for citation can lead to problems. The fidelity of citation is directly dependent on the chunk size - while a large chunk size might encompass an entire page for a single fact, smaller chunks might lack the required context for information extraction.

As we increase the context window size to maintain context, it could result in massive chunks, leading to fact-checking difficulties due to the disproportionate size of the cited document and the information it supports.

A New Direction: Citations with Quotes

To enhance the reliability of citations, one can use direct quotes from the source text, thereby reducing chances of misinterpretation. The combination of Pydantic and OpenAI function calls can streamline this process.

notion image

To understand this process, let's follow the example available here:

  1. Define question-answer and fact classes, ensuring every fact has linked evidence. When possible, request direct quotes.
  1. Instead of using the actual substring_quote as citation, apply a regular expression to the chunk string. This introduces flexibility to accommodate string variations. The span of the actual quote from the original context can then be identified.
notion image

This approach, while dealing with contextual questions, provides Pydantic objects that link any fact with the exact substring from the original chunk. As a result, we can cite specific snippets instead of whole pages, effectively capturing the evidence for each generation.

notion image

Implications of Quote Citations

This novel approach to citation offers two significant benefits:

  1. Fact-checking becomes easier as we only need to compare evidence with specific snippets, rather than scanning through large chunks of text.
  1. With a dataset of question-citation pairs, we can fine-tune our models to rank chunks based on their likelihood of being cited, moving beyond the constraint of raw similarity.

By leveraging regex, we've opened a new chapter in the citation process within LLMs, enhancing the system's reliability and ease-of-use. The combination of OpenAI functionality with Pydantic offers a unique way to handle these structured data challenges.

Embrace this refined approach to citation by checking out the repo and follow @jxnlco on Twitter for more insights and updates!