AI with an EQ

Editor’s Note: This content is sponsored by Marketing AI Institute partner RAD AI.

400 years ago this year, in 1622, the phrase “live and let live” was written by medieval merchants to govern trade throughout Europe, North Africa and Asia Minor.

from MAII https://ift.tt/97yr1Gb
via IFTTT

AI-Written Critiques Help Humans Notice Flaws

AI-Written Critiques Help Humans Notice Flaws

We trained "critique-writing" models to describe flaws in summaries. Human evaluators find flaws in summaries much more often when shown our model’s critiques. Larger models are better at self-critiquing, with scale improving critique-writing more than summary-writing. This shows promise for using AI systems to assist human supervision of AI systems on difficult tasks.

Read paperView dataset

We want to ensure that future AI systems performing very difficult tasks remain aligned with human intent. Many previous works on aligning language models rely on human evaluations as a training signal. However, humans struggle at evaluating very difficult tasks—for example, it is hard to spot every bug in a codebase or every factual error in a long essay. Models may then learn to give outputs that look good to humans but have errors we systematically fail to notice.

To mitigate this problem, we want to train AI assistants that help humans provide feedback on hard tasks. These assistants should point out flaws, help humans understand what’s going on, and answer their questions. An example of this is our past work on book summarization: reading the entire book is a lot of work, but humans assisted with chapter summaries have a much easier time evaluating a book summary.

As a proof of concept, we used supervised learning to train language models to write critiques of topic-based summaries of short stories, Wikipedia articles, and other texts from the internet. We use these models to assist human evaluators and study scaling properties of critique writing.

Experiments with AI assistance

We compare human ratings of AI-written summaries between a control group receiving no assistance and an assisted group who get to see 8 AI-written critiques. Summaries are picked from 3 different sources. Assisted humans find about 50% more flaws in summaries than unassisted raters, using model critiques directly for most of the critiques they find.

To see how useful our models are for evaluation assistance, we show labelers 8 model-written critiques of each summary, with a control group that receives no assistance. We use topic-based summaries from three sources: written by our models, written by humans, and written by humans deliberately to have important yet subtle flaws.

New Jersey is in the crosshairs of a major winter storm that could paralyze parts of New England and dump in excess of a foot of snow on the Garden State by Saturday. The forecast remains highly volatile and may change dramatically in the coming 24 hours.

Throughout the day, The Star-Ledger will provide updates here (newest on top) as new information comes in, watches and warnings are issued and the forecast changes.

10:30 P.M. Weather forecasters tonight reiterated warnings for drivers and residents that a potentially dangerous portion of the storm will be hitting much of central and northern New Jersey during Friday’s evening rush-hour. Major travel delays are expected late Friday and Friday night as rain turns into snow, the National Weather Service forecast said.

MORE SNOWSTORM UPDATES

• Friday, Feb. 8: N.J. snowstorm: Live updates on blizzard, traffic, flooding and more

• Saturday, Feb. 9: N.J. snowstorm update: Power outages, snow totals and other storm news

After periods of rain, heavy snow is expected to be falling in many places by late Friday afternoon , the forecast said. In some places north of Interstate 78, snow is expected to come down between 1 and 2 inches per hour. In counties like Sussex, Morris and Warren, expected snow accumulations range from 6 to 16 inches.

For many towns from Jackson in Ocean County to Somerville in Somerset County and out east to Long Beach Island, snow accumulation is expected to range from 4 to 10 inches. High winds are expected throughout the region, topping out in Monmouth County, with gusts up to 45 mph possible.

By daybreak Saturday, flurries will taper off, giving way to a sunny, blustery day, the latest forecast said.

9:12 P.M. With forecasters still predicting a major winter storm to hit New Jersey, many schools throughout the state are preemptively canceling or delaying classes Friday.

8:45 P.M. In advance of the storm, NJ Transit has announced it will be offering full systemwide cross-honoring all day Friday and all day Saturday, enabling customers to use their ticket or pass on an alternate travel mode — rail, bus or light rail.

5 P.M. The signatures of thunder-snow (which is just what it sounds like — thunder and lightning during heavy snow) are showing up on several models, according to NY NJ PA Weather meteorologistSteven DiMartino.

This indicates the potential for extremely heavy snow to fall in eastern New Jersey tomorrow night, and adds to the unpredictability to totals.

”Where you get some of this convective snow, when it comes down, it’s going to come down very, very hard,” he said. “It’s difficult to pinpoint just where these bands are going to occur. You could end up with a situation where one town has 18 inches of snow and the next town over has three.”

DiMartino stressed the volatility that remains in the forecast, and urged state residents to pay close attention to changing conditions. Many of the details of what ultimately will happen in local areas will not be determined until the storm beings to come together tomorrow.

He said the potential for these heavier snow bands to develop may be why some forecast models (like the NAM, above), are predicting much heavier snowfall totals than the National Weather Service.

[]

The North American Model (NAM), released this afternoon, showed well over a foot of snow falling over many areas in New Jersey.

4:13 P.M. The National Weather Service has issued a blizzard warning for parts of northeastern New Jersey, including Newark and Jersey City, and the five boroughs of New York, where upwards of 14 inches of snow are expected along with howling winds and severely reduced visibility.

The blizzard warnings are in effect from 6 a.m. Friday until 1 p.m. Saturday and warn of 10 to 14 inches of snow, with locally higher amounts and white-out conditions with wind gusts of up to 45 miles per hour. Blizzard conditions are expected in coastal northeastern New Jersey, in southern Bergen and Passaic Counties and Eastern Hudson, Essex and Union counties.

Further north and west, 10 to 14 inches of snow are also expected, but winds are not expected to reach blizzard criteria. Winter storm warnings are in effect there.

3:24 P.M. The National Weather Service at Mount Holly has issued Winter Storm warnings for several counties in northern and central New Jersey and extended further them further south than the areas the previously issued watches covered.

The winter storm warnings have been issued for Sussex, Warren, Morris, Hunterdon, Middlesex, Monmouth, Ocean and northwest Burlington counties. In Sussex, Warren and Morris counties, the National Weather Service is expecting between ten to 16 inches of snow to fall, while other counties in the warning areacould receive six to ten inches. The warnings are in effect from 6 a.m. Friday to 6 a.m. Saturday.

Expect the National Weather Service’s Upton, N.Y. office, which covers northeastern N.J., to follow suit shortly.

Further south, winter weather advisories have been issued for the rest of the state, where between two and five inches of snow is anticipated.

3:07 P.M.The private and public sectors in New Jersey are now bracing for major storm impacts.

More than 350 United Airlines flights, many based out of Newark-Liberty International Airport, have already been canceled, according to flight tracking website FlightAware. NJ Transit announced they will cross-honor tickets across its entire system. Utilities like Jersey Central Power & Light and PSE&G say they will have extra crews on hand to deal with potential power issues caused by heavy snow and wind.

Additionally, several events are being postponed across the state, such as two sectional high school track championships. The state Office of Emergency Management has not yet opened its operations center in Trenton, but it remains a possibility. Mary Goepfert, a spokeswoman for OEM, said the state is monitoring the storm closely and has been in contact with local emergency managers in preparation.

2:07 P.M. The European model is in and it looks snowy, much like many of the other models that ran earlier. Were this to verify, a six to 12-inch plus snowfall is definitely in the cards for north and central New Jersey, particularly north of Interstate-195.

Freehold-based meteorologist and owner of NY NJ PA Weather Steven DiMartino said he likes the European solution best, so far, and agrees with totals.

What does the NAM look like, you ask? Well the snowfall printout is posted below, but Eric Holthaus tweeted a picture of the simulated radar produced by the NAM model for tomorrow night. An absolute monster.

1:50 P.M. The most-affected regions of Hurricane Sandy along the New Jersey coast are about to take another hit. With defenses already weakened, coastal communities could see major impacts from coastal flooding, with the worst coming Saturday morning, according to the National Weather Service.

”I’m really worried about the areas worst hit by Sandy,” said NWS meteorologist Gary Szatkowski. “Time is starting to work against us…We could see substantial beach erosion. I know people have been working hard, but there’s less to erode. We could easily see waves and water coming into areas you typically wouldn’t.”

Szatkowski said he is concerned about the Raritan Bay shore in particular, where a three foot storm surge is possible at high tide Saturday morning, with five to seven foot waves breaking over top of it.

1:22 P.M. Tomorrow night’s commute could be awful in northern New Jersey. By 7 p.m., there is a threat that snowfall rates could reach two inches per hour across large swaths of northern and central New Jersey. Snowfall rates of this magnitude could reduce visibility substantially, wreak havoc on roads and make travel dangerous, if not nearly impossible.

Gary Szatkowski, meteorologist in charge at the National Weather Service’s Mount Holly office, said he is going “very worried” about deteoriorating conditions in the afternoon, and posted a map on Twitter showing where the threat of intense snowfall will be at 7 p.m.

12:34 P.M. An important thing to remember about this storm is the volatility in the forecast remains high, even though models have been trending snowier. State Climatologist David Robinson said the bust potential for this forecast is “tremendous” and the slightest shift in the forecast track could mean the difference between a major snowstorm, and a primarily rain event for much of the state.

Eric Holthaus, of the Wall Street Journal, points out that how much warm air enters region prior to storm will be crucial

12:04 P.M. The National Weather Service at Mount Hollyand Upton, N.Y. both issued briefing packages on the coming storm this morning. Each warned that blizzard conditions may occur Friday night in northern New Jersey. Mount Holly suggested blizzard warnings may be necessary as the storm unfolds.

Blizzard warnings are issued during very specific situations by the National Weather Service. Anticipated winds of at least 35 miles per hour and visibility reduced below a quarter of a mile for a period of three hours is necessary before the agency pulls the trigger on such a warning. Travel would become all but impossible.

11:53 A.M. David Robinson, the state climatologist at Rutgers University, said he does not envy forecasters today, calling this type of storm “the most difficult forecast a New Jersey meteorologist will have to make.” The forecast is complicated for a number of reasons, from New Jersey’s geography to the thermal profile of the atmosphere. More on why New Jersey winter storms are so hard to pin down later.

11:35 A.M. Forecast model guidance on the storm continues to vary but appears to be focusing in on a snowier solution for northern and central New Jersey. Overnight, several reliable models (The European, GFS and NAM) showed very different solutions to the storm, showing everything from minor event to a major winter storm that would have serious impacts on travel in northern sections of the state.

This morning, the GFS and NAM both showed the bulk of New Jersey north of I-195 receiving several inches of snow, perhaps exceeding a foot in some areas. The latest run of the European model, considered one of the most reliable, will be released at approximately 1:30 p.m.

[]

The North American Model (NAM) shows an even snowier solution for New Jersey, with parts of the state easily exceeding a foot of snow.

Keep in mind, each model run is just one of scores of pieces of data the National Weather Service uses to make forecasts and no single model should be viewed as a complete representation of what will happen.

11:30 A.M. A winter storm watch remains in effectfor the vast majority of northern and central New Jersey. Current forecasts call for six to 12 inches of snow, with higher amounts possible in the northern most sections of New Jersey.

Because the storm is highly complex and much remains uncertain, particularly where the rain/snow line will fall, the National Weather Service is holding off on issuing any warnings until this afternoon.

_The Associated Press contributed to this report._

Follow @SStirling

Note to readers: if you purchase something through one of our affiliate links we may earn a commission.

What does the article say about the storm’s effects on daily life?





Daily events are expected to be heavily disrupted during the storm as the conditions for a blizzard are ripe. Travel is expected to be heavily affected, and NJ Transit announced that commuters will be able to use their tickets for any transit option like buses, ferries, etc. will be able to use it for another method. Visibility is expected to be extremely reduced as well. Furthermore, due to the coastal areas having already been affected by the hurricane Sandy, this storm may push waves and water inland to a degree much further than previously seen. The storm can also end up destroying the roads, making travel impossible. Finally, many schools preemptively canceled schools in addition to several events.
The storm could paralyze parts of New England, and dump a foot of snow on New Jersey. Travel will be impacted. Many flights have already been canceled. Schools are preemptively canceling or delaying classes. Events are being postponed. There is a threat that snowfall rates could reach two inches per hour across large swaths of northern and central New Jersey. Snowfall rates of this magnitude could reduce visibility substantially, wreak havoc on roads and make travel dangerous, if not nearly impossible.




The response inaccurately mentions that the ferry is alternative transit option during the storm.
The last sentence doesn’t make sense
The answer should make some mention of the expected snow
This should point out that the storm will cause major travel delays as rain turns into snow.

Even though summarization isn’t actually a difficult task for humans and our models aren’t more capable than humans, they already provide meaningful assistance: when asked to evaluate model-written summaries, the assisted group finds 50% more flaws than the control group. For deliberately misleading summaries, assistance increases how often humans spot the intended flaw from 27% to 45%.

Scaling properties of critiques

Assistance on model-written summaries only works if they are able to critique themselves. We ask humans to rate the helpfulness of model-written self-critiques, and find larger models are better at self-critiquing.

Larger models are better at self-critiquing in our topic-based summarization domain: Even though larger models have answers that are more difficult to critique, they generate more helpful critiques of their own outputs. In this plot, model scale is measured in log loss (nats) after fine-tuning. Helpfulness is determined by a human judging whether the model-generated critique of the model-generated answer is valid and useful for understanding summary quality. We filter for summaries that humans found a critique for.

We also find that large models are able to directly improve their outputs, using their self-critiques, which small models are unable to do. Using better critiques helps models make better improvements than they do with worse critiques, or with no critiques.

Do models tell us everything they know?

To provide the best evaluation assistance on difficult tasks, we would like models to communicate all problems that they “know about.” Whenever a model correctly predicts that an answer is flawed, can the model also produce a concrete critique that humans understand?

This is particularly important for supervising models that could attempt to mislead human supervisors or hide information. We would like to train equally smart assistance models to point out what humans don’t notice.

Unfortunately, we found that models are better at discriminating than at critiquing their own answers, indicating they know about some problems that they can't or don't articulate. Furthermore, the gap between discrimination and critique ability did not appear to decrease for larger models. Reducing this gap is an important priority for our alignment research.

Next steps

An important limitation of this work is that topic-based summarization is not actually a difficult task: humans understand it quite well and it takes them only about 10 minutes to evaluate a summary. To understand the limits of AI-assisted evaluation better, we need to work with tasks that are much more difficult for humans to evaluate.

Nevertheless, these results make us optimistic that we can train models to provide humans with meaningful feedback assistance. This is an important pillar of our alignment strategy, starting with the work on debate and recursive reward modeling. In the long run, we want to build assistants that can be trusted to take on all of the cognitive labor needed for evaluation, so humans can focus on communicating their preferences.

If you’re interested in this line of research, we’re hiring! Apply for roles under “All teams” and mention your interest in scalable alignment!


source https://openai.com/blog/critiques/

Techniques for Training Large Neural Networks

Techniques for Training Large Neural Networks

Large neural networks are at the core of many recent advances in AI, but training them is a difficult engineering and research challenge which requires orchestrating a cluster of GPUs to perform a single synchronized calculation. As cluster and model sizes have grown, machine learning practitioners have developed an increasing variety of techniques to parallelize model training over many GPUs. At first glance, understanding these parallelism techniques may seem daunting, but with only a few assumptions about the structure of the computation these techniques become much more clear—at that point, you're just shuttling around opaque bits from A to B like a network switch shuttles around packets.

Data Parallelism

Techniques for Training Large Neural Networks

Pipeline Parallelism

Techniques for Training Large Neural Networks

Tensor Parallelism

Techniques for Training Large Neural Networks

Expert Parallelism

Techniques for Training Large Neural Networks

Data Parallelism

Techniques for Training Large Neural Networks

Pipeline Parallelism

Techniques for Training Large Neural Networks

Tensor Parallelism

Techniques for Training Large Neural Networks

Expert Parallelism

Techniques for Training Large Neural Networks

An illustration of various parallelism strategies on a three-layer model. Each color refers to one layer and dashed lines separate different GPUs.

No Parallelism

Training a neural network is an iterative process. In every iteration, we do a pass forward through a model's layers to compute an output for each training example in a batch of data. Then another pass proceeds backward through the layers, propagating how much each parameter affects the final output by computing a gradient with respect to each parameter. The average gradient for the batch, the parameters, and some per-parameter optimization state is passed to an optimization algorithm, such as Adam, which computes the next iteration's parameters (which should have slightly better performance on your data) and new per-parameter optimization state. As the training iterates over batches of data, the model evolves to produce increasingly accurate outputs.

Various parallelism techniques slice this training process across different dimensions, including:

  • Data parallelism—run different subsets of the batch on different GPUs;
  • Pipeline parallelism—run different layers of the model on different GPUs;
  • Tensor parallelism—break up the math for a single operation such as a matrix multiplication to be split across GPUs;
  • Mixture-of-Experts—process each example by only a fraction of each layer.

(In this post, we'll assume that you are using GPUs to train your neural networks, but the same ideas apply to those using any other neural network accelerator.)

Data Parallelism

Data Parallel training means copying the same parameters to multiple GPUs (often called "workers") and assigning different examples to each to be processed simultaneously. Data parallelism alone still requires that your model fits into a single GPU’s memory, but lets you utilize the compute of many GPUs at the cost of storing many duplicate copies of your parameters. That being said, there are strategies to increase the effective RAM available to your GPU, such as temporarily offloading parameters to CPU memory between usages.

As each data parallel worker updates its copy of the parameters, they need to coordinate to ensure that each worker continues to have similar parameters. The simplest approach is to introduce blocking communication between workers: (1) independently compute the gradient on each worker; (2) average the gradients across workers; and (3) independently compute the same new parameters on each worker. Step (2) is a blocking average which requires transferring quite a lot of data (proportional to the number of workers times the size of your parameters), which can hurt your training throughput. There are various asynchronous synchronization schemes to remove this overhead, but they hurt learning efficiency; in practice, people generally stick with the synchronous approach.

Pipeline Parallelism

With Pipeline Parallel training, we partition sequential chunks of the model across GPUs. Each GPU holds only a fraction of parameters, and thus the same model consumes proportionally less memory per GPU.

It’s straightforward to split a large model into chunks of consecutive layers. However, there’s a sequential dependency between inputs and outputs of layers, so a naive implementation can lead to a large amount of idle time while a worker waits for outputs from the previous machine to be used as its inputs. These waiting time chunks are known as “bubbles,” wasting the computation that could be done by the idling machines.

Techniques for Training Large Neural Networks
Forward
Techniques for Training Large Neural Networks
Backward
Techniques for Training Large Neural Networks
Gradient update
Techniques for Training Large Neural Networks
Idle
Techniques for Training Large Neural Networks

Illustration of a naive pipeline parallelism setup where the model is vertically split into 4 partitions by layer. Worker 1 hosts model parameters of the first layer of the network (closest to the input), while worker 4 hosts layer 4 (which is closest to the output). “F”, “B”, and “U” represent forward, backward and update operations, respectively. The subscripts indicate on which worker an operation runs. Data is processed by one worker at a time due to the sequential dependency, leading to large “bubbles” of idle time.

We can reuse the ideas from data parallelism to reduce the cost of the bubble by having each worker only process a subset of data elements at one time, allowing us to cleverly overlap new computation with wait time. The core idea is to split one batch into multiple microbatches; each microbatch should be proportionally faster to process and each worker begins working on the next microbatch as soon as it’s available, thus expediting the pipeline execution. With enough microbatches the workers can be utilized most of the time with a minimal bubble at the beginning and end of the step. Gradients are averaged across microbatches, and updates to the parameters happen only once all microbatches have been completed.

The number of workers that the model is split over is commonly known as pipeline depth.

During the forward pass, workers only need to send the output (called activations) of its chunk of layers to the next worker; during the backward pass, it only sends the gradients on those activations to the previous worker. There’s a big design space of how to schedule these passes and how to aggregate the gradients across microbatches. GPipe has each worker process forward and backward passes consecutively and then aggregates gradients from multiple microbatches synchronously at the end. PipeDream instead schedules each worker to alternatively process forward and backward passes.

Techniques for Training Large Neural Networks
Forward
Techniques for Training Large Neural Networks
Backward
Techniques for Training Large Neural Networks
Update
Techniques for Training Large Neural Networks
Idle
GPipe

Techniques for Training Large Neural Networks

PipeDream

Techniques for Training Large Neural Networks

Comparison of GPipe and PipeDream pipelining schemes, using 4 microbatches per batch. Microbatches 1-8 correspond to two consecutive data batches. In the image, “(number)” indicates on which microbatch an operation is performed and the subscript marks the worker ID. Note that PipeDream gets more efficiency by performing some computations with stale parameters.

Tensor Parallelism

Pipeline parallelism splits a model “vertically” by layer. It's also possible to “horizontally” split certain operations within a layer, which is usually called Tensor Parallel training. For many modern models (such as the Transformer), the computation bottleneck is multiplying an activation batch matrix with a large weight matrix. Matrix multiplication can be thought of as dot products between pairs of rows and columns; it's possible to compute independent dot products on different GPUs, or to compute parts of each dot product on different GPUs and sum up the results. With either strategy, we can slice the weight matrix into even-sized “shards”, host each shard on a different GPU, and use that shard to compute the relevant part of the overall matrix product before later communicating to combine the results.

One example is Megatron-LM, which parallelizes matrix multiplications within the Transformer's self-attention and MLP layers. PTD-P uses tensor, data, and pipeline parallelism; its pipeline schedule assigns multiple non-consecutive layers to each device, reducing bubble overhead at the cost of more network communication.

Sometimes the input to the network can be parallelized across a dimension with a high degree of parallel computation relative to cross-communication. Sequence parallelism is one such idea, where an input sequence is split across time into multiple sub-examples, proportionally decreasing peak memory consumption by allowing the computation to proceed with more granularly-sized examples.

Mixture-of-Experts (MoE)

With the Mixture-of-Experts (MoE) approach, only a fraction of the network is used to compute the output for any one input. One example approach is to have many sets of weights and the network can choose which set to use via a gating mechanism at inference time. This enables many more parameters without increased computation cost. Each set of weights is referred to as “experts,” in the hope that the network will learn to assign specialized computation and skills to each expert. Different experts can be hosted on different GPUs, providing a clear way to scale up the number of GPUs used for a model.

Techniques for Training Large Neural Networks

Illustration of a mixture-of-experts (MoE) layer. Only 2 out of the n experts are selected by the gating network. (Image adapted from: Shazeer et al., 2017)

GShard scales an MoE Transformer up to 600 billion parameters with a scheme where only the MoE layers are split across multiple TPU devices and other layers are fully duplicated. Switch Transformer scales model size to trillions of parameters with even higher sparsity by routing one input to a single expert.

Other Memory Saving Designs

There are many other computational strategies to make training increasingly large neural networks more tractable. For example:

  • To compute the gradient, you need to have saved the original activations, which can consume a lot of device RAM. Checkpointing (also known as activation recomputation) stores any subset of activations, and recomputes the intermediate ones just-in-time during the backward pass. This saves a lot of memory at the computational cost of at most one additional full forward pass. One can also continually trade off between compute and memory cost by selective activation recomputation, which is checkpointing subsets of the activations that are relatively more expensive to store but cheaper to compute.

  • Mixed Precision Training is to train models using lower-precision numbers (most commonly FP16). Modern accelerators can reach much higher FLOP counts with lower-precision numbers, and you also save on device RAM. With proper care, the resulting model can lose almost no accuracy.

  • Offloading is to temporarily offload unused data to the CPU or amongst different devices and later read it back when needed. Naive implementations will slow down training a lot, but sophisticated implementations will pre-fetch data so that the device never needs to wait on it. One implementation of this idea is ZeRO which splits the parameters, gradients, and optimizer states across all available hardware and materializes them as needed.

  • Memory Efficient Optimizers have been proposed to reduce the memory footprint of the running state maintained by the optimizer, such as Adafactor.

  • Compression also can be used for storing intermediate results in the network. For example, Gist compresses activations that are saved for the backward pass; DALL·E compresses the gradients before synchronizing them.


At OpenAI, we are training and improving large models from the underlying infrastructure all the way to deployment them for real-world problems. If you’d like to put the ideas from this post into practice—especially relevant for our Scaling and Applied Research teams—we’re hiring!


Acknowledgments
Thanks to Nikolas Tezak, Sam Altman, Daniel Gackle, Ilya Sutskever, and Steve Dowling for feedback on drafts. Thanks to Justin Jay Wang, Bianca Martin, and Steve Dowling for communications and design.

source https://openai.com/blog/techniques-for-training-large-neural-networks/