update readme

2023-03-27 13:52:00 +08:00 · 2023-03-27 13:52:00 +08:00 · 0f8acb8882
commit 0f8acb8882
parent 6dd4907629
2 changed files with 8 additions and 8 deletions
--- a/README.md
+++ b/README.md
@ -27,10 +27,10 @@ Open-Llama是一个开源项目，提供了一整套用于构建大型语言模
 由于训练大语言模型的成本高昂，因此在构建大型语言模型时，高性能也是非常重要的。为了实现高性能的训练，我们发布使用了以下技术：
- **Fused CUDA kernel**：使用xformers中提供的 fused CUDA kernel 可以将多个操作融合在一起，减少了 GPU 和 CPU 之间的数据传输，从而提高了训练效率。
+- **Fused CUDA kernel**：使用[xformers](https://github.com/facebookresearch/xformers)中提供的 fused CUDA kernel 可以将多个操作融合在一起，减少了 GPU 和 CPU 之间的数据传输，从而提高了训练效率。
- **并行化训练**：我们使用Accelerate库支持在多个 GPU 上进行并行化训练，以加快训练速度。
+- **并行化训练**：我们使用[Accelerate](https://huggingface.co/docs/accelerate/index)库支持在多个 GPU 上进行并行化训练，以加快训练速度。
-对于7B模型，使用Transformers中Pytorch原生版本的Llama模型训练训练速度为1378 token/s/gpu，使用本代码库训练速度达到3290 token/s/gpu，基本达到Llama原文中的3370 token/s/gpu。
+对于7B模型，使用Transformers中Pytorch原生版本的Llama模型训练训练速度为1378 token/s/gpu，使用本代码库训练速度达到3290 token/s/gpu，基本达到[Llama原文](https://arxiv.org/pdf/2302.13971.pdf)中的3370 token/s/gpu。
 如果使用500B token进行预训练，需要训练43000 GPU时。按照Google Cloud上A100-80G Spot的价格计算，8卡每小时价格为12.6美元，则总价格为67725美元。
 当使用未加速版本训练时，价格为158744美元。最终降低训练成本9万美元。
 ### 通用性
@ -108,7 +108,7 @@ Self Attention的计算，这对于性能有明显的提升，提升大约30%。
 ```bash
 accelerate launch --config_file configs/default_config.yaml pretrain_llama.py
 ```
-我们使用Wandb进行训练的可视化，需要自行修改环境变量 WANDB_API_KEY 。
+我们使用[Wandb](https://wandb.ai/)进行训练的可视化，需要自行修改环境变量 WANDB_API_KEY 。
 其中我们使用了DeepSpeed stage1以减少显存占用。accelerate相关配置可见configs/default_config.yaml。
--- a/README_en.md
+++ b/README_en.md
@ -24,11 +24,11 @@ We believe that ease of use is one of the most important features when building
 ### High Performance
 Since training large language models is costly, high performance is also crucial when building large-scale language models. To achieve high-performance training, we employ the following techniques:
- **Fused CUDA kernel**: Using fused CUDA kernels provided by xformers can fuse multiple operations together, reducing data transfer between GPU and CPU, and improving training efficiency.
+- **Fused CUDA kernel**: Using fused CUDA kernels provided by [xformers](https://github.com/facebookresearch/xformers) can fuse multiple operations together, reducing data transfer between GPU and CPU, and improving training efficiency.
- **Parallel training**: We use the Accelerate library to support parallel training on multiple GPUs, accelerating the training process.
+- **Parallel training**: We use the [Accelerate](https://huggingface.co/docs/accelerate/index) library to support parallel training on multiple GPUs, accelerating the training process.
-For 7B mode, the training speed of the Llama model using the PyTorch native version in the Transformers library is 1378 tokens/s/GPU. With our code, the training speed reaches 3290 tokens/s/GPU, which is close to the reported 3370 tokens/s/GPU in the Llama paper.
+For 7B mode, the training speed of the Llama model using the PyTorch native version in the Transformers library is 1378 tokens/s/GPU. With our code, the training speed reaches 3290 tokens/s/GPU, which is close to the reported 3370 tokens/s/GPU in the [Llama paper](https://arxiv.org/pdf/2302.13971.pdf).
 If we pretrain with 500 billion tokens, it will take 43,000 GPU hours. Assuming the price of A100-80G Spot on Google Cloud is $12.6 per hour for 8 GPUs, the total cost will be $67,725.
 Without acceleration, the cost would be $158,744. Our method reduces the training cost by $90,019 in total.
@ -99,7 +99,7 @@ We use the Accelerate library for multi-GPU parallel training. Launch training w
 ```bash
 accelerate launch --config_file configs/default_config.yaml pretrain_llama.py
 ```
-We use Wandb for training visualization and you need to modify the environment variable WANDB_API_KEY.
+We use [Wandb](https://wandb.ai/) for training visualization and you need to modify the environment variable WANDB_API_KEY.
 We use DeepSpeed stage 1 to reduce GPU memory usage. Accelerate-related configurations can be found in configs/default_config.yaml.