Generalization of Large Language Model Pretraining

by Chenyan Xiong from MSR

Data

Pretraining Data Quality is really important!

Cleaning and filtering are necessary. (simply adding Fasttext classifier can improve quality)

BERT size pretraining data: Wikipedia + Google Book Corpus ~16GB text
XLNet and RoBERTa size pretraining data: Wikipedia + xxx ~100-200GB text
T5 size pretraining data: C4 ~ 745GB text
Larger high-quality pretraining dataset: ClueWeb22 (Bing’s 34TB high-quality web corpus open sourced for research community)

Pretraining Task Basics: MLM-style task is simply the most optimal pretrained task
Dynamic Training

Using heuristic curriculum learning to improve pretraining

MLM tasks can have slower training later in the progress
Application Specific

Some downstream tasks require centrain-specific information/knowledge/semantic/signal

In Question Answering, focus more on the named entity, use the Salient Span Masking pretraining

In Dense Retrieval, focus more on full-text sequence embedding, use the Sequence Contrastive Learning

Optimization Basics

SGD has various limitations, momentum carries past batches of information and is more stable

Adam is still one of the best choices 7 years later
- Needs a burn-in period for momentum
- Sometimes beta needs slight tuning
- Simple, elegant, and often works the best
Reducing the Memory Footprint of Optimizer

The main cost of Adam: GPU memory usage of optimizer states

Total Memory Cost: parameter + gradient + 1st order component + 2nd order components
Stable Optimization

Critical Component in pretraining

Tons of divergence. Very painfully, many divergences point only appear in the large-scale model

Help us to stable the training:
- Tuning Learning Rate and Scheduler (balancing stability and efficiency)
- Gradient Norm Clipping (trim out outlier in the scholastic learning)
- Dynamic layer-wise scaling (smaller starting weights for deeper layers)
Some initialization can help after 1 month of training

Downstream Usage

Familiar fields including:enrich LM (retrieval-augmented) / composition (chaining transformers) / zero(few) shot learning / finetuning

A new Regime

Differences from the pre-BERT era:

AI systems now respond to our intervention quite differently
Different bottlenecks in the full ecosystem, many times the challenge is not on the modeling size
Different model capacity, ability, and behaviors

New Directions

Improving the efficiency of ML models with a holistic view of the full ML stack

To change the model to relative position embedding requires modifying the CUDA/Apex code
Data-Driven AI

Training data as a new way to convey inductive biases

the model has the capacity and ability to capture whether what’s in the data is correct or wrong
Understanding and Theory

The generalization power of pretraining is mostly observed in NLP instead of CV

In NLP, with tons of supervised labels finetuning, the benefits of pretraining still exist.

Tips for Continual Pre-training task Design

Not trying to twist the pre-trained model checkpoint too much, too much twisting by continual pretraining may cause damage to the original pretrained checkpoint
If doing continual pretraining for knowledge-intensive tasks, try to design tasks to enhance one specific ability for the models. Many of the trained results of continual pretraining may be duplicated with the original version of pretraining

Provide feedback