A newly developed system assists human fact-checkers in confirming the accuracy of responses produced by large language models (LLMs). This system can enhance the validation process and enable users to detect mistakes in AI models faster, reducing the time it takes to validate responses by 20 percent.
While large language models display remarkable abilities, they are not without flaws. These AI systems sometimes generate false or unsupported information, a phenomenon often referred to as “hallucination.”
Due to this hallucination issue, responses from an LLM typically undergo verification by human fact-checkers, especially in critical fields like health care or finance. Unfortunately, the validation process often involves reviewing lengthy documents referenced by the model, which can be tedious and prone to errors, potentially discouraging users from utilizing generative AI in the first place.
To assist human validators, a team of researchers from MIT has developed an intuitive system named SymGen, allowing users to verify LLM outputs much more efficiently. This tool enables an LLM to produce responses complete with citations that directly indicate the relevant section in a source document, such as a specific cell in a database.
As users hover over the highlighted text in the model’s response, they can view the data utilized for that particular word or phrase. Meanwhile, non-highlighted sections indicate which phrases require additional scrutiny for verification.
“We empower users to focus selectively on the text portions that may require more attention. Ultimately, SymGen boosts users’ confidence in a model’s responses by making it easier to confirm the accuracy of the information,” explains Shannon Shen, an electrical engineering and computer science graduate student and co-lead author of the paper on SymGen.
In a user study, Shen and his colleagues observed that SymGen improved the verification speed by approximately 20 percent compared to traditional methods. By streamlining the validation process, SymGen may assist users in detecting inaccuracies in LLM outputs across various real-world applications, including generating clinical documentation and summarizing financial market reports.
Shen is joined in this work by co-lead author and fellow EECS graduate student Lucas Torroba Hennigen; fellow EECS graduate student Aniruddha “Ani” Nrusimha; Bernhard Gapp, president of the Good Data Initiative; and senior authors David Sontag, EECS professor and a member of the MIT Jameel Clinic and the leader of the Clinical Machine Learning Group at the Computer Science and Artificial Intelligence Laboratory (CSAIL); and Yoon Kim, an assistant professor of EECS affiliated with CSAIL. This research was presented at the Conference on Language Modeling.
Symbolic references
Many LLMs are built to produce citations that reference external documents alongside their language-based responses to facilitate validation. However, Shen notes that these verification methods are often implemented without considering the effort needed for users to wade through numerous citations.
“The goal of generative AI is to minimize the time users spend on tasks. If you end up investing hours reviewing documents to ensure the accuracy of the model’s outputs, the practicality of these generative responses diminishes,” Shen comments.
The researchers tackled the validation challenge by focusing on the humans responsible for the task.
A typical SymGen user will first supply the LLM with data for reference, for example, a table containing basketball game statistics. Then, instead of immediately asking the model to generate a summary using the data, the researchers introduce an intermediate step. They prompt the model to create its response in a symbolic manner.
When given this prompt, the model cites data by specifically noting the cell from the data table relevant to its response. For example, if the model intends to mention “Portland Trailblazers,” it would use the cell name from the data that contains those words rather than stating the text directly.
“This intermediate symbolic representation allows for precise references. We can pinpoint exactly where every segment of text in the output corresponds to in the data,” explains Torroba Hennigen.
SymGen then resolves each reference using a rule-based method that extracts the corresponding text from the data table and includes it in the model’s response.
“By this approach, we ensure that the citations are verbatim copies, significantly reducing the chance of inaccuracies in the sections corresponding to actual data,” Shen adds.
Streamlining validation
The model can produce symbolic responses while leveraging how it has been trained. Large language models are trained on vast amounts of internet data, where some information is represented in a “placeholder format” with codes standing in for specific values.
When SymGen prompts the model for a symbolic response, it mirrors this structured format.
“We’re intentionally designing the prompt to harness the LLM’s strengths,” says Shen.
In user testing, most participants indicated that SymGen simplified the process of verifying LLM-generated text. They managed to validate the model’s responses around 20 percent faster than with conventional verification methods.
However, the efficiency of SymGen is limited by the quality of the source data. The model may reference an incorrect variable, leading to oversight by the human verifier.
Additionally, users must provide source data in an organized format, such as a table, as SymGen currently functions only with tabular data.
Looking ahead, the researchers are working to enhance SymGen to accommodate various text and data types. This upgrade could enable it to assist in validating sections of AI-generated legal document summaries, for instance. They also plan to evaluate SymGen with medical professionals to explore how it might uncover errors in AI-generated clinical summaries.