📌📖 LLM Citations, the right way
With the rise of Language Models (LLMs) in Question-Answering (QA) systems tackling high-stake tasks, the conventional methods of citation need an upgrade. This post explores how regular expressions (regex) can transform citation methods, using OpenAI's functionality in sync with the Pydantic library.

If you want to just look at the code you can check it out here
The Existing State of Citation in LLMs
At present, citations in LLMs follow a simple rule: matching the IDs of retrieved text chunks with the generated outputs. This can be done in one of two ways:
- Identifying Sources: Information is directly linked to sources. For instance, "The author attended uwaterloo and was the president of the data science club" (Source: 2, 5, 23).
- In-text Citations: Source information is embedded within the text. Example: "The author attended uwaterloo[2,4] and was the president of the data science club[23]."
Although this approach enables a direct connection between the cited information and the source, it also requires scanning the entire source text for verification, making the process cumbersome.

Citation Challenges: Chunk Size Matters
Relying solely on chunk IDs for citation can lead to problems. The fidelity of citation is directly dependent on the chunk size - while a large chunk size might encompass an entire page for a single fact, smaller chunks might lack the required context for information extraction.
As we increase the context window size to maintain context, it could result in massive chunks, leading to fact-checking difficulties due to the disproportionate size of the cited document and the information it supports.
A New Direction: Citations with Quotes
To enhance the reliability of citations, one can use direct quotes from the source text, thereby reducing chances of misinterpretation. The combination of Pydantic and OpenAI function calls can streamline this process.

To understand this process, let's follow the example available here:
- Define question-answer and fact classes, ensuring every fact has linked evidence. When possible, request direct quotes.
- Instead of using the actual
substring_quote
as citation, apply a regular expression to the chunk string. This introduces flexibility to accommodate string variations. The span of the actual quote from the original context can then be identified.

This approach, while dealing with contextual questions, provides Pydantic objects that link any fact with the exact substring from the original chunk. As a result, we can cite specific snippets instead of whole pages, effectively capturing the evidence for each generation.

Implications of Quote Citations
This novel approach to citation offers two significant benefits:
- Fact-checking becomes easier as we only need to compare evidence with specific snippets, rather than scanning through large chunks of text.
- With a dataset of question-citation pairs, we can fine-tune our models to rank chunks based on their likelihood of being cited, moving beyond the constraint of raw similarity.
By leveraging regex, we've opened a new chapter in the citation process within LLMs, enhancing the system's reliability and ease-of-use. The combination of OpenAI functionality with Pydantic offers a unique way to handle these structured data challenges.
Embrace this refined approach to citation by checking out the repo and follow @jxnlco on Twitter for more insights and updates!