Site icon YSL News

How to Improve AI Chatbot Responses: A Guide to Prevent Toxic Replies

A new method has been developed to more effectively check the safety of an AI chatbot. Researchers have modified their model to prompt a chatbot to generate toxic responses, which are then used to prevent the chatbot from giving hateful or harmful answers when it is in use.

When using ChatGPT, a user can request the AI chatbot to write a computer program or summarize an article, and it would likely be able to produce helpful code or a clear summary. However, it is also possible for someone to ask for instructions to build a bomb, and the chatbot might be able to provide those as well.

In order to prevent this and other safety issues, the researchers have worked on a new method to ensure the AI chatbot’s responses are safe and non-toxic.Companies that create large language models typically protect them through a process known as red-teaming. Red-teaming involves teams of human testers creating prompts that are designed to elicit harmful or toxic text from the model being tested. These prompts are then used to train the chatbot to avoid providing such responses.

However, this method is only effective if the engineers are aware of which toxic prompts to use. If human testers miss certain prompts, which is likely due to the numerous possibilities, a chatbot that is considered safe could still potentially produce unsafe answers.

Researchers from Improbable AI Lab at MIT and the MIT-IBM Watson AI Lab utilized machine learning to enhance the red-teaming process.The researchers created a method to train a red-team large language model to produce various prompts that prompt a wider range of negative responses from the chatbot being tested. They achieve this by instructing the red-team model to be inquisitive when generating prompts and to concentrate on unique prompts that elicit harmful responses from the target model. This approach surpassed human testers and other machine-learning methods by generating more diverse prompts that provoked increasingly harmful responses. Their method not only enhances the scope of inputs being tested compared to other automated methods, but it also outperforms them.Researchers have found that large language models can not only generate toxic responses, but also draw them out from chatbots that were supposed to have safeguards in place. This poses a challenge for ensuring the safety of these models in rapidly changing environments. Zhang-Wei Hong, a graduate student in electrical engineering and computer science, and lead author of the research paper, explained that the current red-teaming process for language models is time-consuming and unsustainable. Their team has developed a faster and more effective method for quality assurance.The team includes Idan Shenfield, Tsun-Hsuan Wang, and Yung-Sung Chuang, graduate students in the EECS program. Aldo Pareja and Akash Srivastava, research scientists at the MIT-IBM Watson AI Lab, are also part of the team. In addition, James Glass, senior research scientist and head of the Spoken Language Systems Group in the Computer Science and Artificial Intelligence Laboratory (CSAIL) is involved, as well as senior author Pulkit Agrawal, director of Improbable AI Lab and an assistant professor in CSAIL. The research will be presented at the International Conference on Learning Representations.

Automated red-teaming involves the training of large language models, such as those used in AI chatbots.The use of machine learning in red-teaming involves exposing models to vast amounts of text from public websites. As a result, they can learn to produce harmful language and discuss illegal activities, as well as inadvertently disclose personal information they have absorbed.

Human red-teaming is typically laborious and expensive, and often ineffective in producing a wide range of prompts to fully protect a model. This has prompted researchers to automate the process using machine learning techniques.

One common approach is to train a red-team model using reinforcement learning, which involves a trial-and-error process that rewards the model for generating prompts that elicit harmful behavior.

During the testing of the chatbot, the responses were observed.

However, the red-team model, using reinforcement learning, often ends up producing similar prompts that are highly toxic in order to maximize its reward.

The MIT researchers applied a curiosity-driven exploration technique in their reinforcement learning approach. This incentivizes the red-team model to be curious about the consequences of each prompt it generates, leading it to try different words, sentence patterns, or meanings in its prompts.

“If the red-team model has already encountered a specific prompt, reproducing it will not lead to any new insights,” the researchers explained.

The goal is to spark interest in the red-team model, which will encourage it to come up with new prompts,” Hong explains.

During training, the red-team model creates a prompt and engages with the chatbot. The chatbot replies, and a safety classifier determines the toxicity of the response, rewarding the red-team model based on this assessment.

Encouraging Curiosity

The main aim of the red-team model is to maximize its reward by provoking an even more toxic response using a unique prompt. The researchers foster curiosity in the red-team model by adjusting the reward signal in the reinforcement learning environment.

Firstly, inThe article discusses how certain bonuses are added to the red-team model to maximize toxicity and encourage randomness in exploring prompts. Two novelty rewards are included to make the agent curious, with one rewarding the model based on word similarity and the other based on semantic similarity. To prevent the model from generating nonsensical text, a naturalistic language bonus is also added to the training objective.kept in a repository and constantly updated. The researchers conducted a comparison of the toxicity and variety of responses generated by their red-team model with other automated techniques, finding that their model performed better than the standard methods. In addition, they utilized their red-team model to evaluate a chatbot that had been adjusted with human input to prevent the generation of toxic responses. Their approach, driven by curiosity, quickly produced 196 prompts that elicited toxic replies from the supposedly “safe” chatbot. The researchers expressed concern about the increasing number of models being developed and updated by companies and labs, anticipating a continued surge in the future.Agrawal emphasizes the importance of verifying models before making them available to the public, as they are becoming increasingly integrated into our daily lives. He notes that manually verifying models is not a scalable solution, which is why their work aims to reduce human effort in ensuring the safety and reliability of AI in the future.

Looking ahead, the researchers aim to expand the red-team model’s ability to generate prompts on a wider range of topics. They also plan to explore using a large language model as a toxicity classifier. This would allow a user to train the toxicity classifier using a company policy document, for example, so that a red-team model can test a chatbot.The policy violations may result in consequences for the company. According to Agrawal, if there are concerns about the behavior of a new AI model, considering curiosity-driven red-teaming may be beneficial. The research is partly funded by Hyundai Motor Company, Quanta Computer Inc., the MIT-IBM Watson AI Lab, Amazon Web Services MLRA research grant, the U.S. Army Research Office, the U.S. Defense Advanced Research Projects Agency Machine Common Sense Program, the U.S. Office of Naval Research, the U.S. Air Force Research Laboratory, and the U.S. Air Force Artificial Intelligence Accelerator. The journal reference is: [Journal Reference]:

  1. Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James Glass, Akash Srivastava, Pulkit Agrawal. Curiosity-driven Red-teaming for Large Language Models. Submitted to arXiv, 2024 DOI: 10.48550/arXiv.2402.19464