llm-conf

Running

App Files Files Community

llm-conf / llm_conf.qmd

muellerzr

Update

264a231 almost 2 years ago

raw

history blame contribute delete

12.2 kB

	---
	title: "Scaling Model Training with More Compute, How Do They Do It?"

	format:
	revealjs:
	theme: moon
	fig-format: png
	---

	## Who am I?

	- Zachary Mueller
	- Technical Lead for the 🤗 Accelerate project
	- API design geek

	## Understanding GPU Usage

	- We can somewhat estimate the memory usage in vanilla full-fine-tuning of models
	- Requires certain assumptions (that I'll be covering):
	- Adam optimizer
	- Batch size of 1

	## Understanding GPU Usage

	General estimate (`bert-base-cased`, 108M params):

	- Each parameter is 4 bytes
	- Backward ~= 2x the model size
	- The optimizer step ~= 4x the model size (1x model, 1x gradients, 2x optimizer):

	::: {style="font-size: 50%;"}
	\| dtype \| Model \| Gradients \| Backward pass \| Optimizer step \| Highest \|
	\|---------\|:-----\|:------:\|:------:\|:------:\|:------:\|
	\| float32 \| 413.18 MB \| 413.18 MB \| 826.36 MB \| 1.61 GB \| 1.61 GB \|
	\| float16 \| 413.18 MB* \| 619.77 MB \| 826.36 MB \| 826.36 MB \| 826.36 MB \|

	*All estimations were based off the [Model Estimator Tool](https://huggingface.co/spaces/hf-accelerate/model-memory-usage)
	:::

	## Understanding GPU Usage

	This works fine for small models, we have cards with anywhere from 12-24GB of GPU memory (on the GPU-poor side).

	But what happens as we scale?

	Here's `llama-3-8B` (8.03B parameters)

	::: {style="font-size: 50%;"}
	\| dtype \| Model \| Gradients \| Backward pass \| Optimizer step \| Highest \|
	\|---------\|:-----\|:------:\|:------:\|:------:\|:------:\|
	\| float32 \| 28.21 GB \| 28.21 GB \| 56.43 GB \| 112.84 GB \| 112.84 GB \|
	\| float16 \| 28.21 GB* \| 42.32 GB \| 56.43 GB \| 56.43 GB \| 56.43 GB \|
	:::
	Well, I don't have 56GB of GPU memory in a single card, let alone 112GB.

	What can we do?

	# Distributed Training

	## Kinds of Training

	* Single GPU:
	* No distributed techniques at play
	* Distributed Data Parallelism (DDP):
	* A full copy of the model exists on each device, but data is chunked between each GPU
	* Fully Sharded Data Parallelism (FSDP) & DeepSpeed (DS):
	* Split chunks of the model and optimizer states across GPUs, allowing for training bigger models on smaller (multiple) GPUs


	# Fully Sharded Data Parallelism

	## Fully Sharded Data Parallelism

	![](fsdp.png)

	:::{.notes}
	* Take the model and split it across `n` GPUs
	* Each GPU computes the shard's gradients
	* At the end, all gradients are synchronized and the final full model gradient is calculated
	* The backward pass can then be performed
	:::

	## FSDP: Getting parameter specific

	* Different parameters can dicatate how much memory is needed for total GPU training across multiple GPUs
	* These include how model weights are sharded, gradients, and more.
	* I'll cover some important ones I needed when doing a Full-Fine-Tune of Llama-3-8B without PEFT on 2x4090's

	## `sharding_strategy`

	* Dictates the level of divving resources to perform
	* `FULL_SHARD`: Includes optimizer states, gradients, and parameters
	* `SHARD_GRAD_OP`: Includes optimizer states and gradients
	* `NO_SHARD`: Normal DDP
	* `HYBRID_SHARD`: Includes optimizer states, gradients, and parameters but each node has the full model

	:::{.notes}
	FULL_SHARD:
	Parameters, Gradients, Optimizer States: All are sharded.
	Parameters Handling: Unshard before forward pass, reshard after forward pass, unshard before backward pass, reshard after backward pass.
	Gradients Handling: Synchronize and shard after backward pass.
	Optimizer States: Updated locally per rank.

	SHARD_GRAD_OP:
	Gradients and Optimizer States: Sharded during computation.
	Parameters: Unshard before forward pass, remain unsharded during forward pass, reshard after backward pass.
	Inside no_sync(): Parameters are not resharded after backward computation.
	Optimizer States: Updated locally per rank.

	NO_SHARD:
	Parameters, Gradients, Optimizer States: Not sharded, replicated across ranks.
	Gradients Handling: Synchronized via all-reduce after backward pass.
	Optimizer States: Updated locally per rank.

	HYBRID_SHARD:
	Parameters, Gradients, Optimizer States: Combines FULL_SHARD within a node and replicates parameters across nodes.
	Communication: Expensive operations like all-gathers and reduce-scatters are limited to within a node, enhancing performance for medium-sized models.
	:::

	## `auto_wrap_policy`:

	* How the model should be split
	* Can be either `TRANSFORMER_BASED_WRAP` or `SIZE_BASED_WRAP`
	* `TRANSFORMER`/`fsdp_transformers_layer_cls_to_wrap`:
	* Need to declare the layer
	* Generally `transformers` has good defaults
	* `SIZE`/`fsdp_min_num_param`:
	* Number of total parameters in a shard

	## `offload_params`:
	* Offloads the parameters and gradients to the CPU if they can't fit into memory
	* Allows you to train much larger models locally, but will be much slower

	> Case: FFT of Llama-3-8B with `fsdp_offload_params` on 2x4090 GPUs was 72hrs, vs ~an hour or two when using 1xH100

	## `cpu_ram_efficient_loading` and `sync_module_states`

	* Uses the idea behind big model inference/the `meta` device to load in the model to the GPU in a low-ram scenario
	* Rather than needing `model_size` * `n_gpus` RAM, we can load the model on a single node and then send the weights directly to each shard when the time is right via `sync_module_states`

	# Tying this to 🤗 Accelerate

	## Tying this to 🤗 Accelerate

	* So far we've covered the theory, but how do we put it into practice
	* By using a library that's at the heart of the entire open-source ecosystem

	::: {style="font-size: 60%;padding-left:10%;padding-top:0%;"}
	* Nearly all of 🤗
	* `axolotl`
	* `fastai`
	* `FastChat`
	* `lucidrains`
	* `kornia`
	:::

	Are you using it and you don't even know?

	## What is 🤗 Accelerate?

	```{mermaid}
	%%\| fig-height: 6
	graph LR
	A(("🤗 Accelerate#32;"))
	A --> B["CLI Interface#32;"]
	A --> C["Training Library#32;"]
	A --> D["Big Model<br>Inference#32;"]
	```

	## A CLI Interface

	* `accelerate config`
	* Configure the environment
	* `accelerate estimate-memory`
	* How to guess vRAM requirements
	* `accelerate launch`
	* How to run your script

	## Launching distributed training is hard

	- ```bash
	python script.py
	```

	- ```bash
	torchrun --nnodes=1 --nproc_per_node=2 script.py
	```

	- ```bash
	deepspeed --num_gpus=2 script.py
	```

	How can we make this better?

	## `accelerate launch`
	```bash
	accelerate launch script.py
	```

	## `accelerate config`

	* Rely on `config.yaml` files
	* Choose to either running `accelerate config` or write your own:

	:::: {.columns style="font-size: 50%;padding-left:10%;"}
	::: {.column width="40%"}
	```{.yaml filename=ddp_config.yaml}
	compute_environment: LOCAL_MACHINE
	distributed_type: MULTI_GPU
	main_training_function: main
	mixed_precision: bf16
	num_machines: 1
	num_processes: 8
	```
	:::

	::: {.column width="40%"}
	```{.yaml filename=fsdp_config.yaml}
	compute_environment: LOCAL_MACHINE
	distributed_type: FSDP
	fsdp_config:
	fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
	fsdp_backward_prefetch: BACKWARD_PRE
	fsdp_cpu_ram_efficient_loading: true
	fsdp_forward_prefetch: false
	fsdp_offload_params: false
	fsdp_sharding_strategy: FULL_SHARD
	fsdp_state_dict_type: SHARDED_STATE_DICT
	fsdp_sync_module_states: true
	fsdp_use_orig_params: false
	main_training_function: main
	mixed_precision: bf16
	num_machines: 1
	num_processes: 8
	```
	:::
	::::

	# A Training Library

	## A Training Library: The Code

	:::: {.columns style="font-size: 50%;"}
	::: {.column}
	<br><br><br>
	```{.python code-line-numbers="5-6,9"}
	# For alignment purposes
	for batch in dataloader:
	optimizer.zero_grad()
	inputs, targets = batch
	inputs = inputs.to(device)
	targets = targets.to(device)
	outputs = model(inputs)
	loss = loss_function(outputs, targets)
	loss.backward()
	optimizer.step()
	scheduler.step()
	```
	:::
	::: {.column}
	```{.python code-line-numbers="1-7,12-13,16"}
	from accelerate import Accelerator
	accelerator = Accelerator()
	dataloader, model, optimizer scheduler = (
	accelerator.prepare(
	dataloader, model, optimizer, scheduler
	)
	)

	for batch in dataloader:
	optimizer.zero_grad()
	inputs, targets = batch
	# inputs = inputs.to(device)
	# targets = targets.to(device)
	outputs = model(inputs)
	loss = loss_function(outputs, targets)
	accelerator.backward(loss) # loss.backward()
	optimizer.step()
	scheduler.step()
	```
	:::

	::::

	## A Training Library: How Scaling Works

	* Accelerate's DataLoaders and schedulers work off of a sharding mindset
	* Rather than repeating the same data across `n` nodes, we instead split it
	* Speeds up training linearly
	* Given a batch size of 16 on a single GPU, to recreate this across 8 GPUs you would use a batch size of 2
	* This also means the scheduler will be stepped `n` GPUs at a time per "global step"

	## A Training Library: Mixed Precision

	* This may be a bit different than your "normal" idea of mixed precision.
	* We do not convert the model weights to BF16/FP16
	* Instead we wrap the forward pass with `autocast` to convert the gradients automatically
	* This preserves the original precision of the weights, which leads to stable training and better fine-tuning later on.
	* If you use `.bf16()` weights, you are STUCK in bf16 perminantly

	## A Training Library: Mixed Precision

	* Let's tie that back up to the model estimator with neat tools like NVIDIA's TransformerEngine

	::: {style="font-size: 60%;"}
	\| Optimization Level \| Computation (GEMM) \| Comm \| Weight \| Master Weight \| Weight Gradient \| Optimizer States \|
	\| -- \| -- \| -- \| -- \| -- \| -- \| -- \|
	\| FP16 AMP \| FP16 \| FP32 \| FP32 \| N/A \| FP32 \| FP32+FP32 \|
	\| Nvidia TE \| FP8 \| FP32 \| FP32 \| N/A \| FP32 \| FP32+FP32 \|
	\| MS-AMP O1 \| FP8 \| FP8 \| FP16 \| N/A \| FP8 \| FP32+FP32 \|
	\| MS-AMP O2 \| FP8 \| FP8 \| FP16 \| N/A \| FP8 \| FP8+FP16 \|
	\| MS-AMP O3 \| FP8 \| FP8 \| FP8 \| FP16 \| FP8 \| FP8+FP16 \|
	:::

	:::{.notes}

	What is actually happening:
	* Linear Layers and other certain compatible layers are wrapped in a special version that allows for FP8 computation
	* The general forward pass is wrapped around BF16
	* This means that the most memory saved is done during the gradients of the model, not the model itself.
	* With tools like `MS-AMP` we can convert more chunks into lower precision, but again like before stable training occurs when the models weights are in full precision and the backprop happens in full precision too.

	:::

	## DeepSpeed vs Fully Sharded Data Parallelism

	* Extremely similar, however mostly used different naming conventions for items and slight tweaks in the implementation

	::: {style="font-size: 50%;"}
	Framework \| Model Loading (`torch_dtype`) \| Mixed Precision \| Preparation (Local) \| Training \| Optimizer (Local)
	--\|--\|--\|--\|--\|--
	FSDP \| bf16 \| default (none) \| bf16 \| bf16 \| bf16
	FSDP \| bf16 \| bf16 \| fp32 \| bf16 \| fp32
	DeepSpeed \| bf16 \| bf16 \| fp32 \| bf16 \| fp32
	:::

	To learn more, check out the [documentation](https://huggingface.co/docs/accelerate/concept_guides/fsdp_and_deepspeed) or join my office hours

	## Key Takeaways:

	* You can scale out training with `accelerate`, FSDP, and DeepSpeed across multiple GPUs to train bigger models
	* Techniques like `FP8` can help speed up training some and reduce computational overhead
	* Comes at a cost of end-precision and locking model weights for futher fine-tunes if not careful

	## Some Handy Resources

	- [🤗 Accelerate documentation](https://hf.co/docs/accelerate)
	- [Launching distributed code](https://huggingface.co/docs/accelerate/basic_tutorials/launch)
	- [Distributed code and Jupyter Notebooks](https://huggingface.co/docs/accelerate/basic_tutorials/notebook)
	- [Migrating to 🤗 Accelerate easily](https://huggingface.co/docs/accelerate/basic_tutorials/migration)
	- [Big Model Inference tutorial](https://huggingface.co/docs/accelerate/usage_guides/big_modeling)
	- [DeepSpeed and 🤗 Accelerate](https://huggingface.co/docs/accelerate/usage_guides/deepspeed)
	- [Fully Sharded Data Parallelism and 🤗 Accelerate](https://huggingface.co/docs/accelerate/usage_guides/fsdp)
	- [FSDP vs DeepSpeed In-Depth](https://huggingface.co/docs/accelerate/concept_guides/fsdp_and_deepspeed)