update readme

This commit is contained in:
LiangSong 2023-03-27 13:52:00 +08:00
parent 6dd4907629
commit 0f8acb8882
2 changed files with 8 additions and 8 deletions

View File

@ -27,10 +27,10 @@ Open-Llama是一个开源项目提供了一整套用于构建大型语言模
由于训练大语言模型的成本高昂,因此在构建大型语言模型时,高性能也是非常重要的。为了实现高性能的训练,我们发布使用了以下技术: 由于训练大语言模型的成本高昂,因此在构建大型语言模型时,高性能也是非常重要的。为了实现高性能的训练,我们发布使用了以下技术:
- **Fused CUDA kernel**使用xformers中提供的 fused CUDA kernel 可以将多个操作融合在一起,减少了 GPU 和 CPU 之间的数据传输,从而提高了训练效率。 - **Fused CUDA kernel**:使用[xformers](https://github.com/facebookresearch/xformers)中提供的 fused CUDA kernel 可以将多个操作融合在一起,减少了 GPU 和 CPU 之间的数据传输,从而提高了训练效率。
- **并行化训练**我们使用Accelerate库支持在多个 GPU 上进行并行化训练,以加快训练速度。 - **并行化训练**:我们使用[Accelerate](https://huggingface.co/docs/accelerate/index)库支持在多个 GPU 上进行并行化训练,以加快训练速度。
对于7B模型使用Transformers中Pytorch原生版本的Llama模型训练训练速度为1378 token/s/gpu使用本代码库训练速度达到3290 token/s/gpu基本达到Llama原文中的3370 token/s/gpu。 对于7B模型使用Transformers中Pytorch原生版本的Llama模型训练训练速度为1378 token/s/gpu使用本代码库训练速度达到3290 token/s/gpu基本达到[Llama原文](https://arxiv.org/pdf/2302.13971.pdf)中的3370 token/s/gpu。
如果使用500B token进行预训练需要训练43000 GPU时。按照Google Cloud上A100-80G Spot的价格计算8卡每小时价格为12.6美元则总价格为67725美元。 如果使用500B token进行预训练需要训练43000 GPU时。按照Google Cloud上A100-80G Spot的价格计算8卡每小时价格为12.6美元则总价格为67725美元。
当使用未加速版本训练时价格为158744美元。最终降低训练成本9万美元。 当使用未加速版本训练时价格为158744美元。最终降低训练成本9万美元。
### 通用性 ### 通用性
@ -108,7 +108,7 @@ Self Attention的计算这对于性能有明显的提升提升大约30%。
```bash ```bash
accelerate launch --config_file configs/default_config.yaml pretrain_llama.py accelerate launch --config_file configs/default_config.yaml pretrain_llama.py
``` ```
我们使用Wandb进行训练的可视化需要自行修改环境变量 WANDB_API_KEY 。 我们使用[Wandb](https://wandb.ai/)进行训练的可视化,需要自行修改环境变量 WANDB_API_KEY 。
其中我们使用了DeepSpeed stage1以减少显存占用。accelerate相关配置可见configs/default_config.yaml。 其中我们使用了DeepSpeed stage1以减少显存占用。accelerate相关配置可见configs/default_config.yaml。

View File

@ -24,11 +24,11 @@ We believe that ease of use is one of the most important features when building
### High Performance ### High Performance
Since training large language models is costly, high performance is also crucial when building large-scale language models. To achieve high-performance training, we employ the following techniques: Since training large language models is costly, high performance is also crucial when building large-scale language models. To achieve high-performance training, we employ the following techniques:
- **Fused CUDA kernel**: Using fused CUDA kernels provided by xformers can fuse multiple operations together, reducing data transfer between GPU and CPU, and improving training efficiency. - **Fused CUDA kernel**: Using fused CUDA kernels provided by [xformers](https://github.com/facebookresearch/xformers) can fuse multiple operations together, reducing data transfer between GPU and CPU, and improving training efficiency.
- **Parallel training**: We use the Accelerate library to support parallel training on multiple GPUs, accelerating the training process. - **Parallel training**: We use the [Accelerate](https://huggingface.co/docs/accelerate/index) library to support parallel training on multiple GPUs, accelerating the training process.
For 7B mode, the training speed of the Llama model using the PyTorch native version in the Transformers library is 1378 tokens/s/GPU. With our code, the training speed reaches 3290 tokens/s/GPU, which is close to the reported 3370 tokens/s/GPU in the Llama paper. For 7B mode, the training speed of the Llama model using the PyTorch native version in the Transformers library is 1378 tokens/s/GPU. With our code, the training speed reaches 3290 tokens/s/GPU, which is close to the reported 3370 tokens/s/GPU in the [Llama paper](https://arxiv.org/pdf/2302.13971.pdf).
If we pretrain with 500 billion tokens, it will take 43,000 GPU hours. Assuming the price of A100-80G Spot on Google Cloud is $12.6 per hour for 8 GPUs, the total cost will be $67,725. If we pretrain with 500 billion tokens, it will take 43,000 GPU hours. Assuming the price of A100-80G Spot on Google Cloud is $12.6 per hour for 8 GPUs, the total cost will be $67,725.
Without acceleration, the cost would be $158,744. Our method reduces the training cost by $90,019 in total. Without acceleration, the cost would be $158,744. Our method reduces the training cost by $90,019 in total.
@ -99,7 +99,7 @@ We use the Accelerate library for multi-GPU parallel training. Launch training w
```bash ```bash
accelerate launch --config_file configs/default_config.yaml pretrain_llama.py accelerate launch --config_file configs/default_config.yaml pretrain_llama.py
``` ```
We use Wandb for training visualization and you need to modify the environment variable WANDB_API_KEY. We use [Wandb](https://wandb.ai/) for training visualization and you need to modify the environment variable WANDB_API_KEY.
We use DeepSpeed stage 1 to reduce GPU memory usage. Accelerate-related configurations can be found in configs/default_config.yaml. We use DeepSpeed stage 1 to reduce GPU memory usage. Accelerate-related configurations can be found in configs/default_config.yaml.