Techniques for Training Large Neural Networks

Techniques for Training Large Neural Networks

Large neural networks are at the core of many recent advances in AI, but training them is a difficult engineering and research challenge which requires orchestrating a cluster of GPUs to perform a single synchronized calculation. As cluster and model sizes have grown, machine learning practitioners have developed an increasing variety of techniques to parallelize model training over many GPUs. At first glance, understanding these parallelism techniques may seem daunting, but with only a few assumptions about the structure of the computation these techniques become much more clear—at that point, you're just shuttling around opaque bits from A to B like a network switch shuttles around packets.

Data Parallelism

Techniques for Training Large Neural Networks

Pipeline Parallelism

Techniques for Training Large Neural Networks

Tensor Parallelism

Techniques for Training Large Neural Networks

Expert Parallelism

Techniques for Training Large Neural Networks

Data Parallelism

Techniques for Training Large Neural Networks

Pipeline Parallelism

Techniques for Training Large Neural Networks

Tensor Parallelism

Techniques for Training Large Neural Networks

Expert Parallelism

Techniques for Training Large Neural Networks

An illustration of various parallelism strategies on a three-layer model. Each color refers to one layer and dashed lines separate different GPUs.

No Parallelism

Training a neural network is an iterative process. In every iteration, we do a pass forward through a model's layers to compute an output for each training example in a batch of data. Then another pass proceeds backward through the layers, propagating how much each parameter affects the final output by computing a gradient with respect to each parameter. The average gradient for the batch, the parameters, and some per-parameter optimization state is passed to an optimization algorithm, such as Adam, which computes the next iteration's parameters (which should have slightly better performance on your data) and new per-parameter optimization state. As the training iterates over batches of data, the model evolves to produce increasingly accurate outputs.

Various parallelism techniques slice this training process across different dimensions, including:

  • Data parallelism—run different subsets of the batch on different GPUs;
  • Pipeline parallelism—run different layers of the model on different GPUs;
  • Tensor parallelism—break up the math for a single operation such as a matrix multiplication to be split across GPUs;
  • Mixture-of-Experts—process each example by only a fraction of each layer.

(In this post, we'll assume that you are using GPUs to train your neural networks, but the same ideas apply to those using any other neural network accelerator.)

Data Parallelism

Data Parallel training means copying the same parameters to multiple GPUs (often called "workers") and assigning different examples to each to be processed simultaneously. Data parallelism alone still requires that your model fits into a single GPU’s memory, but lets you utilize the compute of many GPUs at the cost of storing many duplicate copies of your parameters. That being said, there are strategies to increase the effective RAM available to your GPU, such as temporarily offloading parameters to CPU memory between usages.

As each data parallel worker updates its copy of the parameters, they need to coordinate to ensure that each worker continues to have similar parameters. The simplest approach is to introduce blocking communication between workers: (1) independently compute the gradient on each worker; (2) average the gradients across workers; and (3) independently compute the same new parameters on each worker. Step (2) is a blocking average which requires transferring quite a lot of data (proportional to the number of workers times the size of your parameters), which can hurt your training throughput. There are various asynchronous synchronization schemes to remove this overhead, but they hurt learning efficiency; in practice, people generally stick with the synchronous approach.

Pipeline Parallelism

With Pipeline Parallel training, we partition sequential chunks of the model across GPUs. Each GPU holds only a fraction of parameters, and thus the same model consumes proportionally less memory per GPU.

It’s straightforward to split a large model into chunks of consecutive layers. However, there’s a sequential dependency between inputs and outputs of layers, so a naive implementation can lead to a large amount of idle time while a worker waits for outputs from the previous machine to be used as its inputs. These waiting time chunks are known as “bubbles,” wasting the computation that could be done by the idling machines.

Techniques for Training Large Neural Networks
Forward
Techniques for Training Large Neural Networks
Backward
Techniques for Training Large Neural Networks
Gradient update
Techniques for Training Large Neural Networks
Idle
Techniques for Training Large Neural Networks

Illustration of a naive pipeline parallelism setup where the model is vertically split into 4 partitions by layer. Worker 1 hosts model parameters of the first layer of the network (closest to the input), while worker 4 hosts layer 4 (which is closest to the output). “F”, “B”, and “U” represent forward, backward and update operations, respectively. The subscripts indicate on which worker an operation runs. Data is processed by one worker at a time due to the sequential dependency, leading to large “bubbles” of idle time.

We can reuse the ideas from data parallelism to reduce the cost of the bubble by having each worker only process a subset of data elements at one time, allowing us to cleverly overlap new computation with wait time. The core idea is to split one batch into multiple microbatches; each microbatch should be proportionally faster to process and each worker begins working on the next microbatch as soon as it’s available, thus expediting the pipeline execution. With enough microbatches the workers can be utilized most of the time with a minimal bubble at the beginning and end of the step. Gradients are averaged across microbatches, and updates to the parameters happen only once all microbatches have been completed.

The number of workers that the model is split over is commonly known as pipeline depth.

During the forward pass, workers only need to send the output (called activations) of its chunk of layers to the next worker; during the backward pass, it only sends the gradients on those activations to the previous worker. There’s a big design space of how to schedule these passes and how to aggregate the gradients across microbatches. GPipe has each worker process forward and backward passes consecutively and then aggregates gradients from multiple microbatches synchronously at the end. PipeDream instead schedules each worker to alternatively process forward and backward passes.

Techniques for Training Large Neural Networks
Forward
Techniques for Training Large Neural Networks
Backward
Techniques for Training Large Neural Networks
Update
Techniques for Training Large Neural Networks
Idle
GPipe

Techniques for Training Large Neural Networks

PipeDream

Techniques for Training Large Neural Networks

Comparison of GPipe and PipeDream pipelining schemes, using 4 microbatches per batch. Microbatches 1-8 correspond to two consecutive data batches. In the image, “(number)” indicates on which microbatch an operation is performed and the subscript marks the worker ID. Note that PipeDream gets more efficiency by performing some computations with stale parameters.

Tensor Parallelism

Pipeline parallelism splits a model “vertically” by layer. It's also possible to “horizontally” split certain operations within a layer, which is usually called Tensor Parallel training. For many modern models (such as the Transformer), the computation bottleneck is multiplying an activation batch matrix with a large weight matrix. Matrix multiplication can be thought of as dot products between pairs of rows and columns; it's possible to compute independent dot products on different GPUs, or to compute parts of each dot product on different GPUs and sum up the results. With either strategy, we can slice the weight matrix into even-sized “shards”, host each shard on a different GPU, and use that shard to compute the relevant part of the overall matrix product before later communicating to combine the results.

One example is Megatron-LM, which parallelizes matrix multiplications within the Transformer's self-attention and MLP layers. PTD-P uses tensor, data, and pipeline parallelism; its pipeline schedule assigns multiple non-consecutive layers to each device, reducing bubble overhead at the cost of more network communication.

Sometimes the input to the network can be parallelized across a dimension with a high degree of parallel computation relative to cross-communication. Sequence parallelism is one such idea, where an input sequence is split across time into multiple sub-examples, proportionally decreasing peak memory consumption by allowing the computation to proceed with more granularly-sized examples.

Mixture-of-Experts (MoE)

With the Mixture-of-Experts (MoE) approach, only a fraction of the network is used to compute the output for any one input. One example approach is to have many sets of weights and the network can choose which set to use via a gating mechanism at inference time. This enables many more parameters without increased computation cost. Each set of weights is referred to as “experts,” in the hope that the network will learn to assign specialized computation and skills to each expert. Different experts can be hosted on different GPUs, providing a clear way to scale up the number of GPUs used for a model.

Techniques for Training Large Neural Networks

Illustration of a mixture-of-experts (MoE) layer. Only 2 out of the n experts are selected by the gating network. (Image adapted from: Shazeer et al., 2017)

GShard scales an MoE Transformer up to 600 billion parameters with a scheme where only the MoE layers are split across multiple TPU devices and other layers are fully duplicated. Switch Transformer scales model size to trillions of parameters with even higher sparsity by routing one input to a single expert.

Other Memory Saving Designs

There are many other computational strategies to make training increasingly large neural networks more tractable. For example:

  • To compute the gradient, you need to have saved the original activations, which can consume a lot of device RAM. Checkpointing (also known as activation recomputation) stores any subset of activations, and recomputes the intermediate ones just-in-time during the backward pass. This saves a lot of memory at the computational cost of at most one additional full forward pass. One can also continually trade off between compute and memory cost by selective activation recomputation, which is checkpointing subsets of the activations that are relatively more expensive to store but cheaper to compute.

  • Mixed Precision Training is to train models using lower-precision numbers (most commonly FP16). Modern accelerators can reach much higher FLOP counts with lower-precision numbers, and you also save on device RAM. With proper care, the resulting model can lose almost no accuracy.

  • Offloading is to temporarily offload unused data to the CPU or amongst different devices and later read it back when needed. Naive implementations will slow down training a lot, but sophisticated implementations will pre-fetch data so that the device never needs to wait on it. One implementation of this idea is ZeRO which splits the parameters, gradients, and optimizer states across all available hardware and materializes them as needed.

  • Memory Efficient Optimizers have been proposed to reduce the memory footprint of the running state maintained by the optimizer, such as Adafactor.

  • Compression also can be used for storing intermediate results in the network. For example, Gist compresses activations that are saved for the backward pass; DALL·E compresses the gradients before synchronizing them.


At OpenAI, we are training and improving large models from the underlying infrastructure all the way to deployment them for real-world problems. If you’d like to put the ideas from this post into practice—especially relevant for our Scaling and Applied Research teams—we’re hiring!


Acknowledgments
Thanks to Nikolas Tezak, Sam Altman, Daniel Gackle, Ilya Sutskever, and Steve Dowling for feedback on drafts. Thanks to Justin Jay Wang, Bianca Martin, and Steve Dowling for communications and design.

source https://openai.com/blog/techniques-for-training-large-neural-networks/

Best Practices for Deploying Language Models

Best Practices for Deploying Language Models

Cohere, OpenAI, and AI21 Labs have developed a preliminary set of best practices applicable to any organization developing or deploying large language models. Computers that can read and write are here, and they have the potential to fundamentally impact daily life. The future of human–machine interaction is full of possibility and promise, but any powerful technology needs careful deployment.

The joint statement below represents a step towards building a community to address the global challenges presented by AI progress, and we encourage other organizations who would like to participate to get in touch.

Joint Recommendation for Language Model Deployment

We’re recommending several key principles to help providers of large language models (LLMs) mitigate the risks of this technology in order to achieve its full promise to augment human capabilities.

While these principles were developed specifically based on our experience with providing LLMs through an API, we hope they will be useful regardless of release strategy (such as open-sourcing or use within a company). We expect these recommendations to change significantly over time because the commercial uses of LLMs and accompanying safety considerations are new and evolving. We are actively learning about and addressing LLM limitations and avenues for misuse, and will update these principles and practices in collaboration with the broader community over time.

We’re sharing these principles in hopes that other LLM providers may learn from and adopt them, and to advance public discussion on LLM development and deployment.

Prohibit misuse


Publish usage guidelines and terms of use of LLMs in a way that prohibits material harm to individuals, communities, and society such as through spam, fraud, or astroturfing. Usage guidelines should also specify domains where LLM use requires extra scrutiny and prohibit high-risk use-cases that aren’t appropriate, such as classifying people based on protected characteristics.


Build systems and infrastructure to enforce usage guidelines. This may include rate limits, content filtering, application approval prior to production access, monitoring for anomalous activity, and other mitigations.

Mitigate unintentional harm


Proactively mitigate harmful model behavior. Best practices include comprehensive model evaluation to properly assess limitations, minimizing potential sources of bias in training corpora, and techniques to minimize unsafe behavior such as through learning from human feedback.


Document known weaknesses and vulnerabilities, such as bias or ability to produce insecure code, as in some cases no degree of preventative action can completely eliminate the potential for unintended harm. Documentation should also include model and use-case-specific safety best practices.

Thoughtfully collaborate with stakeholders


Build teams with diverse backgrounds and solicit broad input. Diverse perspectives are needed to characterize and address how language models will operate in the diversity of the real world, where if unchecked they may reinforce biases or fail to work for some groups.


Publicly disclose lessons learned regarding LLM safety and misuse in order to enable widespread adoption and help with cross-industry iteration on best practices.


Treat all labor in the language model supply chain with respect. For example, providers should have high standards for the working conditions of those reviewing model outputs in-house and hold vendors to well-specified standards (e.g., ensuring labelers are able to opt out of a given task).

As LLM providers, publishing these principles represents a first step in collaboratively guiding safer large language model development and deployment. We are excited to continue working with each other and with other parties to identify other opportunities to reduce unintentional harms from and prevent malicious use of language models.

Download as PDF

Support from other organizations

“While LLMs hold a lot of promise, they have significant inherent safety issues which need to be worked on. These best practices serve as an important step in minimizing the harms of these models and maximizing their potential benefits.”

—Anthropic

“As large language models (LLMs) have become increasingly powerful and expressive, risk mitigation becomes increasingly important. We welcome these and other efforts to proactively seek to mitigate harms and highlight to users areas requiring extra diligence. The principles outlined here are an important contribution to the global conversation.”

—John Bansemer, Director of the CyberAI Project and Senior Fellow, Center for Security and Emerging Technology (CSET)

“Google affirms the importance of comprehensive strategies in analyzing model and training data to mitigate the risks of harm, bias, and misrepresentation. It is a thoughtful step taken by these AI providers to promote the principles and documentation towards AI safety.”

—Google Cloud Platform (GCP)

“The safety of foundation models, such as large language models, is a growing social concern. We commend Cohere, OpenAI, and AI21 Labs for taking a first step to outline high-level principles for responsible development and deployment from the perspective of model developers. There is still much work to be done, and we believe it is essential to engage more voices from academia, industry, and civil society to develop more detailed principles and community norms. As we state in our recent blog post, it is not just the end result but the legitimacy of the process that matters.”

—Percy Liang, Director of the Stanford Center for Research on Foundation Models (CRFM)

Get involved

If you’re developing language models or are working to mitigate their risks, we’d love to talk with you. Please reach out at bestpractices@openai.com.

source https://openai.com/blog/best-practices-for-deploying-language-models/

Powering Next Generation Applications with OpenAI Codex


Powering Next Generation Applications with OpenAI Codex

OpenAI Codex, a natural language-to-code system based on GPT-3, helps turn simple English instructions into over a dozen popular coding languages. Codex was released last August through our API and is the principal building block of GitHub Copilot.

Our motivation behind Codex is to supplement developers’ work and increase productivity. Codex helps computers to better understand people’s intent, which enables everyone to do more with computers. This is an integral part of our mission to build general-purpose AI that benefits all of humanity.

For enterprise customers, Microsoft’s Azure OpenAI Service provides developers with access to Codex and our other models, like GPT-3 and embeddings, along with enterprise-grade capabilities that are built into Microsoft Azure. At its Build conference today, Microsoft announced that Azure OpenAI Service—previously available by invitation only—is now available in a limited access preview. We’re already seeing new applications of Azure OpenAI Service across many industry verticals, from healthcare to financial services.

Applications and Industries

Since its release via our API, we’ve been working closely with developers to build on top of Codex. These applications utilize the system’s capabilities in a variety of categories including creativity, learning, productivity and problem solving.

Applications using Codex:

Powering Next Generation Applications with OpenAI Codex

GitHub Copilot is an AI pair programmer that provides suggestions for whole lines or entire functions right inside the code editor.

Through tight integration with Codex, GitHub Copilot can convert comments to code, autofill repetitive code, suggest tests and show alternatives.

Available for Visual Studio and Visual Studio Code, among other environments, GitHub Copilot works with a broad set of frameworks and languages, and for some programming languages suggests approximately 35% of the code generated by tens of thousands of developers who use it today.

Microsoft announced at its Build developer conference that GitHub Copilot will move to general availability this summer.

Powering Next Generation Applications with OpenAI Codex

Pygma aims to turn Figma designs into high-quality code.

Pygma utilizes Codex to turn Figma designs into different frontend frameworks and match the coding style and preferences of the developer. Codex enables Pygma to help developers do tasks instantly that previously could have taken hours.

“Codex has allowed me to integrate innovative features into my app with very little coding. As someone without a strong machine learning background, certain features like flexible code-tweaking would be incredibly difficult to build in-house. With Codex, it works almost out of the box.”

—Emile Paffard-Wray, Founder, Pygma

Powering Next Generation Applications with OpenAI Codex

Replit is a programming platform for any programming language that lets users collaborate live on projects, learn about code and share work with a community of learners and builders.

Replit leverages Codex to describe what a selection of code is doing in simple language so everyone can get quality explanation and learning tools. Users can highlight selections of code and click “Explain Code” to use Codex to understand its functionality.

“Codex helps learning on Replit better understand code they encounter. We’ve only scratched the surface of what semantic code understanding can offer those who want to get from idea to working code quickly.”

—Amjad Masad, Founder, Replit

Powering Next Generation Applications with OpenAI Codex

Warp is a Rust-based terminal, reimagined from the ground up to help both individuals and teams be more productive in the command-line.

Terminal commands are typically difficult to remember, find and construct. Users often have to leave the terminal and search the web for answers and even then the results might not give them the right command to execute. Warp uses Codex to allow users to run a natural language command to search directly from within the terminal and get a result they can immediately use.

“Codex allows Warp to make the terminal more accessible and powerful. Developers search for entire commands using natural language rather than trying to remember them or assemble them piecemeal. Codex-powered command search has become one of our game changing features.

—Zach Lloyd, Founder, Warp

Powering Next Generation Applications with OpenAI Codex

Machinet helps professional Java developers write quality code by using Codex to generate intelligent unit test templates.

Machinet was able to accelerate their development several-fold by switching from building their own machine learning systems to using Codex. The flexibility of Codex allows for the ability to easily add new features and capabilities saving their users time and helping them be more productive.

“Codex is an amazing tool in our arsenal. Not only does it allow us to generate more meaningful code, but it has also helped us find a new design of product architecture and got us out of a local maximum.”

—Vladislav Yanchenko, Founder, Machinet

source https://openai.com/blog/codex-apps/