fix typo
This commit is contained in:
parent
0fdca8b949
commit
2fd13ff075
|
@ -38,11 +38,11 @@ pip install git+https://github.com/s-JoL/transformers.git@dev
|
|||
|
||||
## **更新**
|
||||
|
||||
本次更新主要包含以下几个方面,相对于v1版本提升有效训练速度50%,其中pad从30%减少至5%,训练速度从3200token/s提升至3600token/s。0.95*3600/(0.7*3200)=1.527
|
||||
本次更新主要包含以下几个方面,相对于v1版本提升有效训练速度50%,其中pad从30%减少至5%,训练速度从3200token/s提升至3600token/s。0.95 * 3600/(0.7 * 3200)=1.527
|
||||
1. 使用HuggingFace的datasets库进行数据读取,具体流程如下
|
||||
1. 使用transform函数将不同数据集的数据统一格式为{'text': 'xxx'}
|
||||
2. 使用Tokenizer进行分词
|
||||
3. 对长序列进行采样,目前提供三种模式,分别是:截断/采样(参考Gopher论文)/切分
|
||||
3. 对长序列进行采样,目前提供三种模式,分别是:截断/采样(参考[Gopher论文](https://arxiv.org/abs/2112.11446))/切分
|
||||
4. 可选:对来自不同doc的文本进行拼接。减少了数据中的pad,加速训练;在v1版本中pad占比为30%,使用拼接后pad占比降低为5%。
|
||||
2. 加入Trainer,对于预训练和指令微调都可以复用,见solver/trainer.py
|
||||
3. 统一预训练和指令微调训练入口为train_lm.py
|
||||
|
|
|
@ -43,7 +43,7 @@ This update mainly includes the following aspects, increasing the effective trai
|
|||
1. Use HuggingFace's datasets library for data reading, with the process as follows:
|
||||
1. Use the transform function to unify data formats from different datasets to {'text': 'xxx'}
|
||||
2. Tokenize using Tokenizer
|
||||
3. Sample long sequences; currently, three modes are provided: truncation, sampling (refer to the Gopher paper), and splitting
|
||||
3. Sample long sequences; currently, three modes are provided: truncation, sampling (refer to the [Gopher paper](https://arxiv.org/abs/2112.11446)), and splitting
|
||||
4. Optional: concatenate texts from different docs, reducing padding in the data and accelerating training. In the v1 version, padding accounted for 30%; after concatenation, padding is reduced to 5%.
|
||||
2. Add Trainer, which can be reused for both pre-training and instruction fine-tuning, see solver/trainer.py
|
||||
3. Unify the pre-training and instruction fine-tuning training entry to train_lm.py
|
||||
|
|
Loading…
Reference in New Issue
Block a user