diff --git a/README.md b/README.md index c549624..482357e 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ * @Author: LiangSong(sl12160010@gmail.com) * @Date: 2023-03-10 21:18:35 * @LastEditors: LiangSong(sl12160010@gmail.com) - * @LastEditTime: 2023-04-28 19:49:29 + * @LastEditTime: 2023-04-28 19:52:27 * @FilePath: /Open-Llama/README.md * @Description: * @@ -37,14 +37,15 @@ pip install git+https://github.com/s-JoL/transformers.git@dev ![image4](assets/multiturn_chat.jpeg) ## **更新** -[2023.4.28] Release v2.0 -本次更新主要包含以下几个方面,相对于v1版本提升有效训练速度50%,其中pad从30%减少至5%,训练速度从3200token/s提升至3600token/s。0.95 * 3600/(0.7 * 3200)=1.527 +**[2023.4.28] Release v2.0** + +本次更新主要包含以下几个方面,相对于v1版本提升有效训练速度**50%**,其中pad从**30%**减少至**5%**,训练速度从**3200token/s**提升至**3600token/s**。0.95 * 3600/(0.7 * 3200)=1.527 1. 使用HuggingFace的datasets库进行数据读取,具体流程如下 1. 使用transform函数将不同数据集的数据统一格式为{'text': 'xxx'} 2. 使用Tokenizer进行分词 3. 对长序列进行采样,目前提供三种模式,分别是:截断/采样(参考[Gopher论文](https://arxiv.org/abs/2112.11446))/切分 - 4. 可选:对来自不同doc的文本进行拼接。减少了数据中的pad,加速训练;在v1版本中pad占比为30%,使用拼接后pad占比降低为5%。 + 4. 可选:对来自不同doc的文本进行拼接。减少了数据中的pad,加速训练;在v1版本中pad占比为**30%**,使用拼接后pad占比降低为**5%**。 2. 加入Trainer,对于预训练和指令微调都可以复用,见solver/trainer.py 3. 统一预训练和指令微调训练入口为train_lm.py 4. 提供更方便的配置,可见configs/pretrain_config.yaml diff --git a/README_en.md b/README_en.md index 10da777..6dc31b1 100644 --- a/README_en.md +++ b/README_en.md @@ -2,7 +2,7 @@ * @Author: LiangSong(sl12160010@gmail.com) * @Date: 2023-03-10 21:18:35 * @LastEditors: LiangSong(sl12160010@gmail.com) - * @LastEditTime: 2023-04-28 19:49:24 + * @LastEditTime: 2023-04-28 19:53:01 * @FilePath: /Open-Llama/README_en.md * @Description: * @@ -38,15 +38,15 @@ Below is a display of the model's multi-turn dialogue ability regarding code: ## **Updates** -[2023.4.28] Release v2.0 +**[2023.4.28] Release v2.0** -This update mainly includes the following aspects, increasing the effective training speed by 50% compared to the v1 version, reducing padding from 30% to 5%, and improving training speed from 3200 tokens/s to 3600 tokens/s. 0.95 * 3600 / (0.7 * 3200) = 1.527 +This update mainly includes the following aspects, increasing the effective training speed by **50%** compared to the v1 version, reducing padding from **30%** to **5%**, and improving training speed from **3200 tokens/s** to **3600 tokens/s**. 0.95 * 3600 / (0.7 * 3200) = 1.527 1. Use HuggingFace's datasets library for data reading, with the process as follows: 1. Use the transform function to unify data formats from different datasets to {'text': 'xxx'} 2. Tokenize using Tokenizer 3. Sample long sequences; currently, three modes are provided: truncation, sampling (refer to the [Gopher paper](https://arxiv.org/abs/2112.11446)), and splitting - 4. Optional: concatenate texts from different docs, reducing padding in the data and accelerating training. In the v1 version, padding accounted for 30%; after concatenation, padding is reduced to 5%. + 4. Optional: concatenate texts from different docs, reducing padding in the data and accelerating training. In the v1 version, padding accounted for **30%**; after concatenation, padding is reduced to **5%**. 2. Add Trainer, which can be reused for both pre-training and instruction fine-tuning, see solver/trainer.py 3. Unify the pre-training and instruction fine-tuning training entry to train_lm.py 4. Provide more convenient configuration, see configs/pretrain_config.yaml