update readme
This commit is contained in:
parent
6dd4907629
commit
0f8acb8882
|
@ -27,10 +27,10 @@ Open-Llama是一个开源项目,提供了一整套用于构建大型语言模
|
||||||
|
|
||||||
由于训练大语言模型的成本高昂,因此在构建大型语言模型时,高性能也是非常重要的。为了实现高性能的训练,我们发布使用了以下技术:
|
由于训练大语言模型的成本高昂,因此在构建大型语言模型时,高性能也是非常重要的。为了实现高性能的训练,我们发布使用了以下技术:
|
||||||
|
|
||||||
- **Fused CUDA kernel**:使用xformers中提供的 fused CUDA kernel 可以将多个操作融合在一起,减少了 GPU 和 CPU 之间的数据传输,从而提高了训练效率。
|
- **Fused CUDA kernel**:使用[xformers](https://github.com/facebookresearch/xformers)中提供的 fused CUDA kernel 可以将多个操作融合在一起,减少了 GPU 和 CPU 之间的数据传输,从而提高了训练效率。
|
||||||
- **并行化训练**:我们使用Accelerate库支持在多个 GPU 上进行并行化训练,以加快训练速度。
|
- **并行化训练**:我们使用[Accelerate](https://huggingface.co/docs/accelerate/index)库支持在多个 GPU 上进行并行化训练,以加快训练速度。
|
||||||
|
|
||||||
对于7B模型,使用Transformers中Pytorch原生版本的Llama模型训练训练速度为1378 token/s/gpu,使用本代码库训练速度达到3290 token/s/gpu,基本达到Llama原文中的3370 token/s/gpu。
|
对于7B模型,使用Transformers中Pytorch原生版本的Llama模型训练训练速度为1378 token/s/gpu,使用本代码库训练速度达到3290 token/s/gpu,基本达到[Llama原文](https://arxiv.org/pdf/2302.13971.pdf)中的3370 token/s/gpu。
|
||||||
如果使用500B token进行预训练,需要训练43000 GPU时。按照Google Cloud上A100-80G Spot的价格计算,8卡每小时价格为12.6美元,则总价格为67725美元。
|
如果使用500B token进行预训练,需要训练43000 GPU时。按照Google Cloud上A100-80G Spot的价格计算,8卡每小时价格为12.6美元,则总价格为67725美元。
|
||||||
当使用未加速版本训练时,价格为158744美元。最终降低训练成本9万美元。
|
当使用未加速版本训练时,价格为158744美元。最终降低训练成本9万美元。
|
||||||
### 通用性
|
### 通用性
|
||||||
|
@ -108,7 +108,7 @@ Self Attention的计算,这对于性能有明显的提升,提升大约30%。
|
||||||
```bash
|
```bash
|
||||||
accelerate launch --config_file configs/default_config.yaml pretrain_llama.py
|
accelerate launch --config_file configs/default_config.yaml pretrain_llama.py
|
||||||
```
|
```
|
||||||
我们使用Wandb进行训练的可视化,需要自行修改环境变量 WANDB_API_KEY 。
|
我们使用[Wandb](https://wandb.ai/)进行训练的可视化,需要自行修改环境变量 WANDB_API_KEY 。
|
||||||
|
|
||||||
其中我们使用了DeepSpeed stage1以减少显存占用。accelerate相关配置可见configs/default_config.yaml。
|
其中我们使用了DeepSpeed stage1以减少显存占用。accelerate相关配置可见configs/default_config.yaml。
|
||||||
|
|
||||||
|
|
|
@ -24,11 +24,11 @@ We believe that ease of use is one of the most important features when building
|
||||||
### High Performance
|
### High Performance
|
||||||
Since training large language models is costly, high performance is also crucial when building large-scale language models. To achieve high-performance training, we employ the following techniques:
|
Since training large language models is costly, high performance is also crucial when building large-scale language models. To achieve high-performance training, we employ the following techniques:
|
||||||
|
|
||||||
- **Fused CUDA kernel**: Using fused CUDA kernels provided by xformers can fuse multiple operations together, reducing data transfer between GPU and CPU, and improving training efficiency.
|
- **Fused CUDA kernel**: Using fused CUDA kernels provided by [xformers](https://github.com/facebookresearch/xformers) can fuse multiple operations together, reducing data transfer between GPU and CPU, and improving training efficiency.
|
||||||
- **Parallel training**: We use the Accelerate library to support parallel training on multiple GPUs, accelerating the training process.
|
- **Parallel training**: We use the [Accelerate](https://huggingface.co/docs/accelerate/index) library to support parallel training on multiple GPUs, accelerating the training process.
|
||||||
|
|
||||||
|
|
||||||
For 7B mode, the training speed of the Llama model using the PyTorch native version in the Transformers library is 1378 tokens/s/GPU. With our code, the training speed reaches 3290 tokens/s/GPU, which is close to the reported 3370 tokens/s/GPU in the Llama paper.
|
For 7B mode, the training speed of the Llama model using the PyTorch native version in the Transformers library is 1378 tokens/s/GPU. With our code, the training speed reaches 3290 tokens/s/GPU, which is close to the reported 3370 tokens/s/GPU in the [Llama paper](https://arxiv.org/pdf/2302.13971.pdf).
|
||||||
If we pretrain with 500 billion tokens, it will take 43,000 GPU hours. Assuming the price of A100-80G Spot on Google Cloud is $12.6 per hour for 8 GPUs, the total cost will be $67,725.
|
If we pretrain with 500 billion tokens, it will take 43,000 GPU hours. Assuming the price of A100-80G Spot on Google Cloud is $12.6 per hour for 8 GPUs, the total cost will be $67,725.
|
||||||
Without acceleration, the cost would be $158,744. Our method reduces the training cost by $90,019 in total.
|
Without acceleration, the cost would be $158,744. Our method reduces the training cost by $90,019 in total.
|
||||||
|
|
||||||
|
@ -99,7 +99,7 @@ We use the Accelerate library for multi-GPU parallel training. Launch training w
|
||||||
```bash
|
```bash
|
||||||
accelerate launch --config_file configs/default_config.yaml pretrain_llama.py
|
accelerate launch --config_file configs/default_config.yaml pretrain_llama.py
|
||||||
```
|
```
|
||||||
We use Wandb for training visualization and you need to modify the environment variable WANDB_API_KEY.
|
We use [Wandb](https://wandb.ai/) for training visualization and you need to modify the environment variable WANDB_API_KEY.
|
||||||
|
|
||||||
We use DeepSpeed stage 1 to reduce GPU memory usage. Accelerate-related configurations can be found in configs/default_config.yaml.
|
We use DeepSpeed stage 1 to reduce GPU memory usage. Accelerate-related configurations can be found in configs/default_config.yaml.
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue
Block a user