update readme

2023-05-08 18:59:01 +08:00 · 2023-05-08 18:59:01 +08:00 · 2df3e622e9
commit 2df3e622e9
parent ec2b4d6ee7
2 changed files with 32 additions and 2 deletions
--- a/README.md
+++ b/README.md
@ -59,9 +59,24 @@ Below is a display of the model's multi-turn dialogue ability regarding code:

 ## **Updates**

+**[2023.5.8] Release v2.1**
+
+This update adds support for larger model training. Using DeepSpeed stage3 + offload + activation checkpoint, you can train a 65B model on a **single machine with 8 A100-80G**.
+The following table compares the training speed of Open-Llama and the original Llama, and the performance data of Llama is quoted from the original Llama paper.
+|                | DeepSpeed Stage | Offload | Activation Checkpoint | Total Token | GPU hours | Speed token/s/gpu | Batch Size | CPU Memory |
+|----------------|-----------------|---------|-----------------------|-------------|-----------|-------------------|------------|------------|
+| Open-Llama 7B  | 1               | False   | False                 | 173.7B      | 13412     | 3587              | 2          | 94G        |
+| Open-Llama 13B | 3               | False   | True                  | -           | -         | 1616              | 24         | 100G       |
+| Open-Llama 33B | 3               | False   | True                  | -           | -         | 708               | 12         | 100G        |
+| Open-Llama 65B | 3               | True    | True                  | -           | -         | 369               | 12         | 440G       |
+| Llama 7B       | -               | -       | -                     | 1T          | 82432     | 3370              | -          | -          |
+| Llama 13B      | -               | -       | -                     | 1T          | 135168    | 2055              | -          | -          |
+| Llama 33B      | -               | -       | -                     | 1.4T        | 530432    | 733               | -          | -          |
+| Llama 65B      | -               | -       | -                     | 1.4T        | 1022362   | 380               | -          | -          |
+
 **[2023.4.28] Release v2.0**

-This update mainly includes the following aspects, increasing the effective training speed by **50%** compared to the v1 version, reducing padding from **30%** to **5%**, and improving training speed from **3200 tokens/s** to **3600 tokens/s**. 0.95 * 3600 / (0.7 * 3200) = 1.527
+This update mainly includes the following aspects, increasing the effective training speed by **50%** compared to the v1 version, reducing padding from **30%** to **5%**, and improving training speed from **3200 tokens/s** to **3587 tokens/s**. 0.95 * 3587 / (0.7 * 3200) = 1.521

 1. Use HuggingFace's datasets library for data reading, with the process as follows:
   1. Use the transform function to unify data formats from different datasets to {'text': 'xxx'}
--- a/README_zh.md
+++ b/README_zh.md
@ -60,9 +60,24 @@ print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

 ## **更新**

+**[2023.5.8] Release v2.1**
+
+本次更新加入对更大模型训练的支持，使用DeepSpeed stage3 + offload + activation checkpoint可以在**单机8卡A100-80G训练65B模型**。
+下表对比了Open-Llama和Llama原文的训练速度，Llama性能数据引自Llama原文。
+|                | DeepSpeed Stage | Offload | Activation Checkpoint | Total Token | GPU hours | Speed token/s/gpu | Batch Size | CPU Memory |
+|----------------|-----------------|---------|-----------------------|-------------|-----------|-------------------|------------|------------|
+| Open-Llama 7B  | 1               | False   | False                 | 173.7B      | 13412     | 3587              | 2          | 94G        |
+| Open-Llama 13B | 3               | False   | True                  | -           | -         | 1616              | 24         | 100G       |
+| Open-Llama 33B | 3               | False   | True                  | -           | -         | 708               | 12         | 100G        |
+| Open-Llama 65B | 3               | True    | True                  | -           | -         | 369               | 12         | 440G       |
+| Llama 7B       | -               | -       | -                     | 1T          | 82432     | 3370              | -          | -          |
+| Llama 13B      | -               | -       | -                     | 1T          | 135168    | 2055              | -          | -          |
+| Llama 33B      | -               | -       | -                     | 1.4T        | 530432    | 733               | -          | -          |
+| Llama 65B      | -               | -       | -                     | 1.4T        | 1022362   | 380               | -          | -          |
+
 **[2023.4.28] Release v2.0**

-本次更新主要包含以下几个方面，相对于v1版本提升有效训练速度**50%**，其中pad从**30%**减少至**5%**，训练速度从**3200token/s**提升至**3600token/s**。0.95 * 3600/(0.7 * 3200)=1.527
+本次更新主要包含以下几个方面，相对于v1版本提升有效训练速度**50%**，其中pad从**30%**减少至**5%**，训练速度从**3200token/s**提升至**3587token/s**。0.95 * 3600/(0.7 * 3200)=1.521
 1. 使用HuggingFace的datasets库进行数据读取，具体流程如下
   1. 使用transform函数将不同数据集的数据统一格式为{'text': 'xxx'}
   2. 使用Tokenizer进行分词