Improving the factual accuracy of language models through web browsing

We've fine-tuned GPT-3 to more accurately answer open-ended questions using a text-based web browser. Our prototype copies how humans research answers to questions online – it submits search queries, follows links, and scrolls up and down web pages. It is trained to cite its sources, which makes it easier to give feedback to improve factual accuracy. We're excited about developing more truthful AI, but challenges remain, such as coping with unfamiliar types of questions.

Read paperBrowse samples

Language models like GPT-3 are useful for many different tasks, but have a tendency to "hallucinate" information when performing tasks requiring obscure real-world knowledge. To address this, we taught GPT-3 to use a text-based web-browser. The model is provided with an open-ended question and a summary of the browser state, and must issue commands such as "Search …", "Find in page: …" or "Quote: …". In this way, the model collects passages from web pages, and then uses these to compose an answer.

The model is fine-tuned from GPT-3 using the same general methods we've used previously. We begin by training the model to copy human demonstrations, which gives it the ability to use the text-based browser to answer questions. Then we improve the helpfulness and accuracy of the model's answers, by training a reward model to predict human preferences, and optimizing against it using either reinforcement learning or rejection sampling.

Cherry-picked samples from our best-performing model (175B with best-of-64 against a reward model).

Explore more samples

ELI5 results

Our system is trained to answer questions from ELI5, a dataset of open-ended questions scraped from the "Explain Like I'm Five" subreddit. We trained three different models, corresponding to three different inference-time compute budgets. Our best-performing model produces answers that are preferred 56% of the time to answers written by our human demonstrators, with a similar level of factual accuracy. Even though these were the same kind of demonstrations used to train the model, we were able to outperform them by using human feedback to improve the model's answers.

Results of human evaluations on the ELI5 test set, comparing our model with human demonstrators. The amount of rejection sampling (the n in best-of-n) was chosen to be compute-efficient. Error bars show ±1 standard error.

TruthfulQA results

For questions taken from the training distribution, our best model's answers are about as factually accurate as those written by our human demonstrators, on average. However, out-of-distribution robustness is a challenge. To probe this, we evaluated our models on TruthfulQA, an adversarially-constructed dataset of short-form questions designed to test whether models fall prey to things like common misconceptions. Answers are scored on both truthfulness and informativeness, which trade off against one another (for example, "I have no comment" is considered truthful but not informative).

Our models outperform GPT-3 on TruthfulQA and exhibit more favourable scaling properties. However, our models lag behind human performance, partly because they sometimes quote from unreliable sources (as shown in the question about ghosts above). We hope to reduce the frequency of these failures using techniques like adversarial training.

TruthfulQA results. For GPT-3, we used the prompts and automated metric from the TruthfulQA paper. For the web-browsing model, we truncated the long-form answers and used human evaluation, since the answers are out-of-distribution for the automated metric. Error bars show ±1 standard error.

Evaluating factual accuracy

In order to provide feedback to improve factual accuracy, humans must be able to evaluate the factual accuracy of claims produced by models. This can be extremely challenging, since claims can be technical, subjective or vague. For this reason, we require the model to cite its sources. This allows humans to evaluate factual accuracy by checking whether a claim is supported by a reliable source. As well as making the task more manageable, it also makes it less ambiguous, which is important for reducing label noise.

However, this approach raises a number of questions. What makes a source reliable? What claims are obvious enough to not require support? What trade-off should be made between evaluations of factual accuracy and other criteria such as coherence? All of these were difficult judgment calls. We do not think that our model picked up on much of this nuance, since it still makes basic errors. But we expect these kinds of decisions to become more important as AI systems improve, and cross-disciplinary research is needed to develop criteria that are both practical and epistemically sound. We also expect further considerations such as transparency to be important.

Eventually, having models cite their sources will not be enough to evaluate factual accuracy. A sufficiently capable model would cherry-pick sources it expects humans to find convincing, even if they do not reflect a fair assessment of the evidence. There are already signs of this happening (see the questions about boats above). We hope to mitigate this using methods like debate.

Risks of deployment and training

Although our model is generally more truthful than GPT-3 (in that it generates false statements less frequently), it still poses risks. Answers with citations are often perceived as having an air of authority, which can obscure the fact that our model still makes basic errors. The model also tends to reinforce the existing beliefs of users. We are researching how best to address these and other concerns.

In addition to these deployment risks, our approach introduces new risks at train time by giving the model access to the web. Our browsing environment does not allow full web access, but allows the model to send queries to the Microsoft Bing Web Search API and follow links that already exist on the web, which can have side-effects. From our experience with GPT-3, the model does not appear to be anywhere near capable enough to dangerously exploit these side-effects. However, these risks increase with model capability, and we are working on establishing internal safeguards against them.

Conclusion

Human feedback and tools such as web browsers offer a promising path towards robustly truthful, general-purpose AI systems. Our current system struggles with challenging or unfamiliar circumstances, but still represents significant progress in this direction.

If you'd like to help us build more helpful and truthful AI systems, we're hiring!


References
  1. O. Evans, O. Cotton-Barratt, L. Finnveden, A. Bales, A. Balwit, P. Wills, L. Righetti, and W. Saunders. Truthful AI: Developing and governing AI that does not lie. arXiv preprint arXiv:2110.06674, 2021.
  2. J. Maynez, S. Narayan, B. Bohnet, and R. McDonald. On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661, 2020.
  3. K. Shuster, S. Poff, M. Chen, D. Kiela, and J. Weston. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567, 2021.
  4. A. Fan, Y. Jernite, E. Perez, D. Grangier, J. Weston, and M. Auli. ELI5: Long form question answering. arXiv preprint arXiv:1907.09190, 2019.
  5. S. Lin, J. Hilton, and O. Evans. TruthfulQA: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
  6. D. Metzler, Y. Tay, D. Bahri, and M. Najork. Rethinking search: Making experts out of dilettantes. arXiv preprint arXiv:2105.02274, 2021.


Acknowledgments

Thanks to our paper co-authors: Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Roger Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight and Benjamin Chess.

Thanks to those who helped with and provided feedback on this release: Steven Adler, Sam Altman, Beth Barnes, Miles Brundage, Kevin Button, Steve Dowling, Alper Ercetin, Matthew Knight, Gretchen Krueger, Ryan Lowe, Andrew Mayne, Bob McGrew, Mira Murati, Richard Ngo, Jared Salzano, Natalie Summers and Hannah Wong.

Thanks to the team at Surge AI for helping us with data collection, and to all of our contractors for providing demonstrations and comparisons, without which this project would not have been possible.

source https://openai.com/blog/improving-factual-accuracy/

Customizing GPT-3 for Your Application

Developers can now fine-tune GPT-3 on their own data, creating a custom version tailored to their application. Customizing makes GPT-3 reliable for a wider variety of use cases and makes running the model cheaper and faster.

You can use an existing dataset of virtually any shape and size, or incrementally add data based on user feedback. With fine-tuning, one API customer was able to increase correct outputs from 83% to 95%. By adding new data from their product each week, another reduced error rates by 50%.

To get started, just run a single command in the OpenAI command line tool with a file you provide. Your custom version will start training and then be available immediately in our API.

Read documentation


Last year we trained GPT-3 and made it available in our API. With only a few examples, GPT-3 can perform a wide variety of natural language tasks, a concept called few-shot learning or prompt design. Customizing GPT-3 can yield even better results because you can provide many more examples than what’s possible with prompt design.

You can customize GPT-3 for your application with one command and use it immediately in our API:

openai api fine_tunes.create -t <train_file>

It takes less than 100 examples to start seeing the benefits of fine-tuning GPT-3 and performance continues to improve as you add more data. In research published last June, we showed how fine-tuning with less than 100 examples can improve GPT-3’s performance on certain tasks. We’ve also found that each doubling of the number of examples tends to improve quality linearly.

With one of our most challenging research datasets, Grade School Math problems, fine-tuning GPT-3 improves accuracy by 2 to 4x over what’s possible with prompt design.

Two sizes of GPT-3 models, Curie and Davinci, were fine-tuned on 8,000 examples from one of our most challenging research datasets, Grade School Math problems. We compare the models’ ability to solve problems when 10 completions are created.

Customizing GPT-3 improves the reliability of output, offering more consistent results that you can count on for production use-cases. One customer found that customizing GPT-3 reduced the frequency of unreliable outputs from 17% to 5%. Since custom versions of GPT-3 are tailored to your application, the prompt can be much shorter, reducing costs and improving latency.

Whether text generation, summarization, classification, or any other natural language task GPT-3 is capable of performing, customizing GPT-3 will improve performance.

Apps Powered by Customized Versions of GPT-3

Keeper Tax helps independent contractors and freelancers with their taxes. After a customer links their financial accounts, Keeper Tax uses various models to extract text and classify transactions. Using the classified data, Keeper Tax identifies easy-to-miss tax write-offs and helps customers file their taxes directly from the app. By customizing GPT-3, Keeper Tax is able to continuously improve results. Once a week, Keeper Tax adds around 500 new training examples to fine-tune their model, which is leading to about a 1% accuracy improvement each week, increasing accuracy from 85% to 93%.

Viable helps companies get insights from their customer feedback. By customizing GPT-3, Viable is able to transform massive amounts of unstructured data into readable natural language reports, highlighting top customer complaints, compliments, requests, and questions. Customizing GPT-3 has increased the reliability of Viable’s reports. By using a customized version of GPT-3, accuracy in summarizing customer feedback has improved from 66% to 90%. The result is tangible, intuitive information that customers need to inform their product decisions.

Sana Labs is a global leader in the development and application of AI to learning. The Sana learning platform powers personalized learning experiences for businesses by leveraging the latest ML breakthroughs to tailor the content for each individual. By customizing GPT-3 with their data, Sana’s question and content generation went from grammatically correct but general responses to highly accurate outputs. This yielded a 60% improvement, enabling fundamentally more personalized and effective experiences for their learners.

Elicit is an AI research assistant that helps people directly answer research questions using findings from academic papers. The tool finds the most relevant abstracts from a large corpus of research papers, then applies a customized version of GPT-3 to generate the claim (if any) that the paper makes about the question. A custom version of GPT-3 outperformed prompt design across three important measures: results were easier to understand (a 24% improvement), more accurate (a 17% improvement), and better overall (a 33% improvement).

All API customers can customize GPT-3 today. Sign-up and get started with the fine-tuning documentation.

How to customize GPT-3 for your application


Set up

  • Install the openai python-based client from your terminal:pip install --upgrade openai
  • Set your API key as an environment variable:export OPENAI_API_KEY=<api_key>

Train a custom model

  • Fine-tune the Ada model on a demo dataset for translating help messages from Spanish to English.
    openai api fine_tunes.create -m ada –n_epochs 2 \
    -t https://ift.tt/3E7mPZT


    (Ctrl-C will interrupt the stream, but not cancel the fine-tune)
    [2021-12-08 12:11:30] Created fine-tune: ft-gK9R3N3lDQYQJD0SXqlF8Fnc
    [2021-12-08 12:11:40] Fine-tune costs $0.01
    [2021-12-08 12:11:40] Fine-tune enqueued. Queue number: 0
    [2021-12-08 12:11:45] Fine-tune started
    [2021-12-08 12:12:58] Completed epoch 1/2
    [2021-12-08 12:13:56] Completed epoch 2/2
    [2021-12-08 12:14:26] Uploaded model: ada:ft-org-2021-12-08-20-14-25
    [2021-12-08 12:14:29] Uploaded result file: file-QvY81nzrOhXMenjMS5OlPeBW
    [2021-12-08 12:14:30] Fine-tune succeeded
    Job complete! Status: succeeded 🎉
    Try out your fine-tuned model:
    openai api completions.create -m ada:ft-org-2021-12-08-20-14-25 -p <YOUR_PROMPT>

Use the custom model

  • Ask your customized model for a translation.
    openai api completions.create -m <model_ID> \
    –max-tokens 30 –temperature 0 –stop "###" \
    -p $'Conecte la PS3 y vaya a Configuración>Configuraciones de Red, seleccione la red y escriba sus credenciales.\nEnglish translation:'


    Conecte la PS3 y vaya a Configuración>Configuraciones de Red, seleccione la red y escriba sus credenciales.\nEnglish translation: Connect the PS3 and go to Settings> Accounts Settings, select the network and write your credentials.%

source https://openai.com/blog/customized-gpt3/