How to Transcribe and Caption 10X Faster with AI from Verbit

Want to automatically transcribe your media content and make it searchable to increase site traffic?

Verbit is an AI-powered transcription and captioning platform that can help you do just that.

The company’s AI produces captions and transcripts of live and recorded content to make media more accessible to consumers and boost SEO.

We spoke with Michal Roche, Director of Marketing at Verbit to learn more about this AI-powered marketing solution.

from MAII https://ift.tt/WDj6zBRwV
via IFTTT

Aligning Language Models to Follow Instructions

We’ve trained language models that are much better at following user intentions than GPT-3 while also making them more truthful and less toxic, using techniques developed through our alignment research. These InstructGPT models, which are trained with humans in the loop, are now deployed as the default language models on our API.

Read PaperView Model Card
InstructGPT is better than GPT-3 at following English instructions.
InstructGPT is better than GPT-3 at following English instructions.
Like GPT-3, InstructGPT can respond to tasks defined implicitly via a prompt, without an explicit instruction.
InstructGPT can give wrong or misleading outputs when the instruction assumes a premise that is not true.
When given a sensitive prompt or instruction, InstructGPT is less likely to produce biased or toxic outputs than GPT-3.
Since InstructGPT is trained to follow instructions, it can be susceptible to misuse.

GPT-3 models aren’t trained to follow user instructions. Our InstructGPT models (highlighted) generate much more helpful outputs in response to user instructions.

The OpenAI API is powered by GPT-3 language models which can be coaxed to perform natural language tasks using carefully engineered text prompts. But these models can also generate outputs that are untruthful, toxic, or reflect harmful sentiments. This is in part because GPT-3 is trained to predict the next word on a large dataset of Internet text, rather than to safely perform the language task that the user wants. In other words, these models aren’t aligned with their users.

To make our models safer, more helpful, and more aligned, we use an existing technique called reinforcement learning from human feedback (RLHF). On prompts submitted by our customers to the API,[1] our labelers provide demonstrations of the desired model behavior, and rank several outputs from our models. We then use this data to fine-tune GPT-3.

The resulting InstructGPT models are much better at following instructions than GPT-3. They also make up facts less often, and show small decreases in toxic output generation. Our labelers prefer outputs from our 1.3B InstructGPT model over outputs from a 175B GPT-3 model, despite having more than 100x fewer parameters. At the same time, we show that we don’t have to compromise on GPT-3’s capabilities, as measured by our model’s performance on academic NLP evaluations.

These InstructGPT models, which have been in beta on the API for more than a year, are now the default language models accessible on our API.[2] We believe that fine-tuning language models with humans in the loop is a powerful tool for improving their safety and reliability, and we will continue to push in this direction.

This is the first time our alignment research, which we’ve been pursuing for several years, has been applied to our product. Our work is also related to recent research that fine-tunes language models to follow instructions using academic NLP datasets, notably FLAN and T0. A key motivation for our work is to increase helpfulness and truthfulness while mitigating the harms and biases of language models. Some of our previous research in this direction found that we can reduce harmful outputs by fine-tuning on a small curated dataset of human demonstrations. Other research has focused on filtering the pre-training dataset, safety-specific control tokens, or steering model generations. We are exploring these ideas and others in our ongoing alignment research.

Results

We first evaluate how well outputs from InstructGPT follow user instructions, by having labelers compare its outputs to those from GPT-3. We find that InstructGPT models are significantly preferred on prompts submitted to both the InstructGPT and GPT-3 models on the API. This holds true when we add a prefix to the GPT-3 prompt so that it enters an “instruction-following mode.”

Quality ratings of model outputs on a 1–7 scale (y-axis), for various model sizes (x-axis), on prompts submitted to InstructGPT models on our API. InstructGPT outputs are given much higher scores by our labelers than outputs from GPT-3 with a few-shot prompt and without, as well as models fine-tuned with supervised learning. We find similar results for prompts submitted to GPT-3 models on the API.

To measure the safety of our models, we primarily use a suite of existing metrics on publicly available datasets. Compared to GPT-3, InstructGPT produces fewer imitative falsehoods (according to TruthfulQA) and are less toxic (according to Realtoxicityprompts. We also conduct human evaluations on our API prompt distribution, and find that InstructGPT makes up facts ("hallucinates") less often, and generates more appropriate outputs.[3]

Dataset
RealToxicity
GPT
0.233

Supervised Fine-Tuning
0.199

InstructGPT
0.196

Dataset
TruthfulQA
GPT
0.224

Supervised Fine-Tuning
0.206

InstructGPT
0.413

API Dataset
Hallucinations
GPT
0.414

Supervised Fine-Tuning
0.078

InstructGPT
0.172

API Dataset
Customer Assistant Appropriate
GPT
0.811

Supervised Fine-Tuning
0.880

InstructGPT
0.902

Evaluating InstructGPT for toxicity, truthfulness, and appropriateness. Lower scores are better for toxicity and hallucinations, and higher scores are better for TruthfulQA and appropriateness. Hallucinations and appropriateness are measured on our API prompt distribution. Results are combined across model sizes.

Finally, we find that InstructGPT outputs are preferred to those from FLAN and T0 on our customer distribution. This indicates that the data used to train FLAN and T0, mostly academic NLP tasks, is not fully representative of how deployed language models are used in practice.

Methods

To train InstructGPT models, our core technique is reinforcement learning from human feedback (RLHF), a method we helped pioneer in our earlier alignment research. This technique uses human preferences as a reward signal to fine-tune our models, which is important as the safety and alignment problems we are aiming to solve are complex and subjective, and aren’t fully captured by simple automatic metrics.

We first collect a dataset of human-written demonstrations on prompts submitted to our API, and use this to train our supervised learning baselines. Next, we collect a dataset of human-labeled comparisons between two model outputs on a larger set of API prompts. We then train a reward model (RM) on this dataset to predict which output our labelers would prefer. Finally, we use this RM as a reward function and fine-tune our GPT-3 policy to maximize this reward using the PPO algorithm.

One way of thinking about this process is that it “unlocks” capabilities that GPT-3 already had, but were difficult to elicit through prompt engineering alone: this is because our training procedure has a limited ability to teach the model new capabilities relative to what is learned during pretraining, since it uses less than 2% of the compute and data relative to model pretraining.

A limitation of this approach is that it introduces an “alignment tax”: aligning the models only on customer tasks can make their performance worse on some other academic NLP tasks. This is undesirable since, if our alignment techniques make models worse on tasks that people care about, they’re less likely to be adopted in practice. We’ve found a simple algorithmic change that minimizes this alignment tax: during RL fine-tuning we mix in a small fraction of the original data used to train GPT-3, and train on this data using the normal log likelihood maximization.[4] This roughly maintains performance on safety and human preferences, while mitigating performance decreases on academic tasks, and in several cases even surpassing the GPT-3 baseline.

Generalizing to broader preferences

Our procedure aligns our models’ behavior with the preferences of our labelers, who directly produce the data used to train our models, and us researchers, who provide guidance to labelers through written instructions, direct feedback on specific examples, and informal conversations. It is also influenced by our customers and the preferences implicit in our API policies. We selected labelers who performed well on a screening test for aptitude in identifying and responding to sensitive prompts. However, these different sources of influence on the data do not guarantee our models are aligned to the preferences of any broader group.

We conducted two experiments to investigate this. First, we evaluate GPT-3 and InstructGPT using held-out labelers[5] who did not produce any of the training data, and found that these labelers prefer outputs from the InstructGPT models at about the same rate as our training labelers. Second, we train reward models on data from a subset of our labelers, and find that they generalize well to predicting the preferences of a different subset of labelers. This suggests that our models haven’t solely overfit to the preferences of our training labelers. However, more work is needed to study how these models perform on broader groups of users, and how they perform on inputs where humans disagree about the desired behavior.

Limitations

Despite making significant progress, our InstructGPT models are far from fully aligned or fully safe; they still generate toxic or biased outputs, make up facts, and generate sexual and violent content without explicit prompting. But the safety of a machine learning system depends not only on the behavior of the underlying models, but also on how these models are deployed. To support the safety of our API, we will continue to review potential applications before they go live, provide content filters for detecting unsafe completions, and monitor for misuse.

A byproduct of training our models to follow user instructions is that they may become more susceptible to misuse if instructed to produce unsafe outputs. Solving this requires our models to refuse certain instructions; doing this reliably is an important open research problem that we are excited to tackle.

Further, in many cases aligning to the average labeler preference may not be desirable. For example, when generating text that disproportionately affects a minority group, the preferences of that group should be weighted more heavily. Right now, InstructGPT is trained to follow instructions in English; thus, it is biased towards the cultural values of English-speaking people. We are conducting research into understanding the differences and disagreements between labelers’ preferences so we can condition our models on the values of more specific populations. More generally, aligning model outputs to the values of specific humans introduces difficult choices with societal implications, and ultimately we must establish responsible, inclusive processes for making these decisions.

Next steps

This is the first application of our alignment research to our product. Our results show that these techniques are effective at significantly improving the alignment of general-purpose AI systems with human intentions. However, this is just the beginning: we will keep pushing these techniques to improve the alignment of our current and future models towards language tools that are safe and helpful to humans.

If you’re interested in these research directions, we’re hiring!





References
  1. Christiano, P., Leike, J., Brown, T.B., Martic, M., Legg, S. and Amodei, D., 2017. Deep reinforcement learning from human preferences. arXiv preprint arXiv:1706.03741.
  2. Stiennon, N., Ouyang, L., Wu, J., Ziegler, D.M., Lowe, R., Voss, C., Radford, A., Amodei, D. and Christiano, P., 2020.
  3. Wu, J., Ouyang, L., Ziegler, D.M., Stiennon, N., Lowe, R., Leike, J. and Christiano, P., 2021. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862.
  4. Wei, J., Bosma, M., Zhao, V.Y., Guu, K., Yu, A.W., Lester, B., Du, N., Dai, A.M. and Le, Q.V., 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  5. Sanh, V., Webson, A., Raffel, C., Bach, S.H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T.L., Raja, A. and Dey, M., 2021. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
  6. Bender, E.M., Gebru, T., McMillan-Major, A. and Shmitchell, S., 2021, March. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?🦜. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610-623).
  7. Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E. and Brynjolfsson, E., 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
  8. Kenton, Z., Everitt, T., Weidinger, L., Gabriel, I., Mikulik, V. and Irving, G., 2021. Alignment of Language Agents. arXiv preprint arXiv:2103.14659.
  9. Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A. and Kenton, Z., 2021. Ethical and social risks of harm from Language Models. arXiv preprint arXiv:2112.04359.
  10. Tamkin, A., Brundage, M., Clark, J. and Ganguli, D., 2021. Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models. arXiv preprint arXiv:2102.02503.
  11. Solaiman, I. and Dennison, C., 2021. Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets. arXiv preprint arXiv:2106.10328.
  12. Ngo, H., Raterink, C., Araújo, J.G., Zhang, I., Chen, C., Morisot, A. and Frosst, N., 2021. Mitigating harm in language models with conditional-likelihood filtration. arXiv preprint arXiv:2108.07790.
  13. Xu, J., Ju, D., Li, M., Boureau, Y.L., Weston, J. and Dinan, E., 2020. Recipes for safety in open-domain chatbots. arXiv preprint arXiv:2010.07079.
  14. Keskar, N.S., McCann, B., Varshney, L.R., Xiong, C. and Socher, R., 2019. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858.
  15. Krause, B., Gotmare, A.D., McCann, B., Keskar, N.S., Joty, S., Socher, R. and Rajani, N.F., 2020. Gedi: Generative discriminator guided sequence generation. arXiv preprint arXiv:2009.06367.
  16. Dathathri, S., Madotto, A., Lan, J., Hung, J., Frank, E., Molino, P., Yosinski, J. and Liu, R., 2019. Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164.
  17. Lin, S., Hilton, J. and Evans, O., 2021. TruthfulQA: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
  18. Gehman, S., Gururangan, S., Sap, M., Choi, Y. and Smith, N.A., 2020. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462.
  19. Rudinger, R., Naradowsky, J., Leonard, B. and Van Durme, B., 2018. Gender bias in coreference resolution. arXiv preprint arXiv:1804.09301.
  20. Nangia, N., Vania, C., Bhalerao, R. and Bowman, S.R., 2020. CrowS-pairs: A challenge dataset for measuring social biases in masked language models. arXiv preprint arXiv:2010.00133.


Acknowledgments

We'd like to thank our paper co-authors: Long Ouyang, Jeff Wu, Roger Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, and Paul Christiano, along with everyone who provided feedback on the paper and blog post. We'd also like to thank the Comms team for their guidance and assistance, including Steve Dowling, Hannah Wong, Elie Georges, Alper Ercetin, Jared Salzano, Allan Diego, and Justin Jay Wang. Finally, we'd like to thank our labelers, without whom this project would not have been possible.


Footnotes

  1. We only use prompts submitted through the Playground to an earlier version of the InstructGPT models that was deployed in January 2021. Our human annotators remove personal identifiable information from all prompts before adding it to the training set. ↩︎

  2. The InstructGPT models deployed in the API are updated versions trained using the same human feedback data. They use a similar but slightly different training method that we will describe in a forthcoming publication. ↩︎

  3. We also measure several other dimensions of potentially harmful outputs on our API distribution: whether the outputs contain sexual or violent content, denigrate a protected class, or encourage abuse. We find that InstructGPT doesn’t improve significantly over GPT-3 on these metrics; the incidence rate is equally low for both models. ↩︎

  4. We found this approach more effective than simply increasing the KL coefficient. ↩︎

  5. These labelers are sourced from Scale AI and Upwork, similarly to our training labelers, but do not undergo a screening test. ↩︎

source https://openai.com/blog/instruction-following/

How to Predict Marketing Performance with No-Code AI from Faraday

Want to actually use AI to attract, engage, and retain customers, without having to code?

Faraday is a no-code AI platform that makes consumer predictions at scale.

The tool’s machine learning creates predictive models that marketers can use to predict cross-sell and up-sell opportunities, as well as churn.

We spoke with Faraday VP, Marketing Danielle Rand to learn more about how this AI-powered solution works.

from MAII https://ift.tt/3AwzeWY
via IFTTT

Introducing Text and Code Embeddings in the OpenAI API

Introducing Text and Code Embeddings in the OpenAI API

We are introducing embeddings, a new endpoint in the OpenAI API that makes it easy to perform natural language and code tasks like semantic search, clustering, topic modeling, and classification. Embeddings are numerical representations of concepts converted to number sequences, which make it easy for computers to understand the relationships between those concepts. Our embeddings outperform top models in 3 standard benchmarks, including a 20% relative improvement in code search.

Read documentationRead paper

Embeddings are useful for working with natural language and code, because they can be readily consumed and compared by other machine learning models and algorithms like clustering or search.

Introducing Text and Code Embeddings in the OpenAI API
Introducing Text and Code Embeddings in the OpenAI API
Introducing Text and Code Embeddings in the OpenAI API
Introducing Text and Code Embeddings in the OpenAI API
Introducing Text and Code Embeddings in the OpenAI API
Introducing Text and Code Embeddings in the OpenAI API

Embeddings that are numerically similar are also semantically similar. For example, the embedding vector of “canine companions say” will be more similar to the embedding vector of “woof” than that of “meow.”

Introducing Text and Code Embeddings in the OpenAI API
Introducing Text and Code Embeddings in the OpenAI API

The new endpoint uses neural network models, which are descendants of GPT-3, to map text and code to a vector representation—“embedding” them in a high-dimensional space. Each dimension captures some aspect of the input.

The new /embeddings endpoint in the OpenAI API provides text and code embeddings with a few lines of code:

import openai
response = openai.Embedding.create(
    input="canine companions say",
    engine="text-similarity-davinci-001")

We’re releasing three families of embedding models, each tuned to perform well on different functionalities: text similarity, text search, and code search. The models take either text or code as input and return an embedding vector.

Models Use Cases
Text similarity: Captures semantic similarity between pieces of text. text-similarity-{ada, babbage, curie, davinci}-001 Clustering, regression, anomaly detection, visualization
Text search: Semantic information retrieval over documents. text-search-{ada, babbage, curie, davinci}-{query, doc}-001 Search, context relevance, information retrieval
Code search: Find relevant code with a query in natural language. code-search-{ada, babbage}-{code, text}-001 Code search and relevance

Text Similarity Models

Text similarity models provide embeddings that capture the semantic similarity of pieces of text. These models are useful for many tasks including clustering, data visualization, and classification.

The following interactive visualization shows embeddings of text samples from the DBpedia dataset:

Drag to pan, scroll or pinch to zoom

Embeddings from the text-similarity-babbage-001 model, applied to the DBpedia dataset. We randomly selected 100 samples from the dataset covering 5 categories, and computed the embeddings via the /embeddings endpoint. The different categories show up as 5 clear clusters in the embedding space. To visualize the embedding space, we reduced the embedding dimensionality from 2048 to 3 using PCA. The code for how to visualize embedding space in 3D dimension is available here.

To compare the similarity of two pieces of text, you simply use the dot product on the text embeddings. The result is a “similarity score”, sometimes called “cosine similarity,” between 0 and 1, where a higher number means more similarity. In most applications, the embeddings can be pre-computed, and then the dot product comparison is extremely fast to carry out.

import openai, numpy as np

resp = openai.Embedding.create(
    input=["feline friends go", "meow"],
    engine="text-similarity-davinci-001")

embedding_a = resp['data'][0]['embedding']
embedding_b = resp['data'][1]['embedding']

similarity_score = np.dot(embedding_a, embedding_b)

One popular use of embeddings is to use them as features in machine learning tasks, such as classification. In machine learning literature, when using a linear classifier, this classification task is called a “linear probe.” Our text similarity models achieve new state-of-the-art results on linear probe classification in SentEval (Conneau et al., 2018), a commonly used benchmark for evaluating embedding quality.

Linear probe classification over 7 datasets
Previous SOTA (Guo et al. 2021)

90.2%
text-similarity-davinci-001

92.2%
Show more

Text Search Models

Text search models provide embeddings that enable large-scale search tasks, like finding a relevant document among a collection of documents given a text query. Embedding for the documents and query are produced separately, and then cosine similarity is used to compare the similarity between the query and each document.

Embedding-based search can generalize better than word overlap techniques used in classical keyword search, because it captures the semantic meaning of text and is less sensitive to exact phrases or words. We evaluate the text search model’s performance on the BEIR (Thakur, et al. 2021) search evaluation suite and obtain better search performance than previous methods. Our text search guide provides more details on using embeddings for search tasks.

Code Search Models

Code search models provide code and text embeddings for code search tasks. Given a collection of code blocks, the task is to find the relevant code block for a natural language query. We evaluate the code search models on the CodeSearchNet (Husian et al., 2019) evaluation suite where our embeddings achieve significantly better results than prior methods. Check out the code search guide to use embeddings for code search.

Average accuracy over 6 programming languages
Previous SOTA (Guo, et al. 2021)

77.4%
code-search-babbage-{doc, query}-001

93.5%
Show more

Examples of the Embeddings API in Action

JetBrains Research

JetBrains Research’s Astroparticle Physics Lab analyzes data like The Astronomer’s Telegram and NASA’s GCN Circulars, which are reports that contain astronomical events that can’t be parsed by traditional algorithms.

Powered by OpenAI’s embeddings of these astronomical reports, researchers are now able to search for events like “crab pulsar bursts” across multiple databases and publications. Embeddings also achieved 99.85% accuracy on data source classification through k-means clustering.

FineTune Learning

FineTune Learning is a company building hybrid human-AI solutions for learning, like adaptive learning loops that help students reach academic standards.

OpenAI’s embeddings significantly improved the task of finding textbook content based on learning objectives. Achieving a top-5 accuracy of 89.1%, OpenAI’s text-search-curie embeddings model outperformed previous approaches like Sentence-BERT (64.5%). While human experts are still better, the FineTune team is now able to label entire textbooks in a matter of seconds, in contrast to the hours that it took the experts.

Comparison of our embeddings with Sentence-BERT, GPT-3 search and human subject-matter experts for matching textbook content with learned objectives. We report accuracy@k, the number of times the correct answer is within the top-k predictions.

Fabius

Fabius helps companies turn customer conversations into structured insights that inform planning and prioritization. OpenAI’s embeddings allow companies to more easily find and tag customer call transcripts with feature requests.

For instance, customers might use words like “automated” or “easy to use” to ask for a better self-service platform. Previously, Fabius was using fuzzy keyword search to attempt to tag those transcripts with the self-service platform label. With OpenAI’s embeddings, they’re now able to find 2x more examples in general, and 6x–10x more examples for features with abstract use cases that don’t have a clear keyword customers might use.

All API customers can get started with the embeddings documentation for using embeddings in their applications.

Read documentation


Acknowledgments

Thanks to the following for their contributions to this release:

Tao Xu, Chris Hallacy, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Johannes Heidecke, Pranav Shyam, Tyna Eloundou Nekoul, Girish Sastry, Gretchen Krueger, David Schnurr, Felipe Petroski Such, Kenny Hsu, Madeleine Thompson, Tabarak Khan, and Toki Sherbakov.

Thanks to the following for their feedback on this post: Tom Kleinpeter, Morgan Gallant, Sam Altman, Ilya Sutskever, Steve Dowling, Rachel Lim, Arun Vijayvergiya, Rajeev Nayak, Peter Welinder, Justin Jay Wang.


source https://openai.com/blog/introducing-text-and-code-embeddings/

How to Build a Smarter Content  Marketing Program with Contrend

Want to read the minds of your audience?

Contrend can help you get complete visibility into audience content preferences and your competitors’ strategies.

The tool uses AI and machine learning to predict and recommend the best content topics, formats, and images that your audience responds to best. 

We spoke with Contrend CEO Peter Bakker and CTO Richard Jones to learn more about how this AI-powered marketing solution works.

from MAII https://ift.tt/3fUgmYg
via IFTTT

AI for CMOs: How to Transform Talent, Tech, and Strategy

CMOs are increasingly turning to AI to solve their biggest challenges. 

That’s no surprise: AI has the potential to transform talent, tech, and strategy for marketing leaders. 

To learn how, we brought together some world-class CMOs and marketing experts to discuss how CMOs and marketers can use AI to unlock their full potential.

from MAII https://ift.tt/3fK7lkw
via IFTTT