update readme

This commit is contained in:
LiangSong 2023-04-28 15:01:01 +08:00
parent 49118aad42
commit 0fdca8b949
10 changed files with 330 additions and 224 deletions

187
README.md
View File

@ -10,30 +10,45 @@
-->
# Open-Llama
[English](https://github.com/Bayes-Song/Open-Llama/blob/main/README_en.md)
[English README](https://github.com/Bayes-Song/Open-Llama/blob/main/README_en.md)
Open-Llama是一个开源项目提供了一整套用于构建大型语言模型的训练流程从数据集准备到分词、预训练、指令调优以及强化学习技术 RLHF。
## 进展
**可从[Demo](http://home.ustc.edu.cn/~sl9292/)直接试用本模型。**
**采用FastChat项目相同方法测评Open-Llama的效果和GPT3.5的效果对比经过测试在中文问题上可以达到GPT3.5 84%的水平具体测试结果和CheckPoint将在近期放出**
## **效果**
**采用FastChat项目相同方法测评Open-Llama的效果和GPT3.5的效果对比经过测试在中文问题上可以达到GPT3.5 84%的水平。**
**训练速度达到3620 token/s快于Llama原文中的3370 token/s达到目前sota的水平。**
经过Instruct-tuning的CheckPoint已开源在[s-JoL/Open-Llama-V1](https://huggingface.co/s-JoL/Open-Llama-V1)。使使用ckpt需要先用下面命令安装最新版本Transformers
``` base
pip install git+https://github.com/s-JoL/transformers.git@dev
```
只经过预训练的CheckPoint也上传至[s-JoL/Open-Llama-V1-pretrain](https://huggingface.co/s-JoL/Open-Llama-V1-pretrain)。
模型已提交[PR](https://github.com/huggingface/transformers/pull/22795)合并至Transformers main分支。
我们完成了300B token的预训练总共训练80 K stepGlobal Batch Size和Llama中一致为4M。
使用总共7部分数据构成Instruction-tuning数据模型具有一定的编程能力、数学能力和多轮对话能力具体数据见Instruction-Tuning部分。
[Demo](http://home.ustc.edu.cn/~sl9292/)
我们参考一些对文心一言的测试也简单测试一下我们的模型,原始报道 [百度“文心一言”测试:国内生成式 AI 什么水平?](https://www.8btc.com/article/6809666)
本模型的效果如下图更多结果还待进一步测试。由于国内网络问题使用上面的Demo可能出现请求丢失的情况如长时间无响应可刷新重试
![image1](assets/image1.png)![image2](assets/image2.png)![image3](assets/image3.png)
下面是一个关于代码的多轮对话能力的展示
如下是一个关于代码的多轮对话能力的展示
![image4](assets/multiturn_chat.jpeg)
我们简单预估一下达到上面效果的一个花费训练40K step使用了1.5亿条预训练数据大约为110B token总共训练时间76h按Google Cloud的A100报价花费大约为19152美元。后续的Instruction-tuning训练了12k Step使用1.6M条数据总共训练时间3.4h大约花费342美元。因此从0开始训练一个这样的模型总花费不到20000美元。
目前模型在数学方面和代码方面表现明显较差,这一方面和训练数据有关,另一方面我认为也是模型大小所造成的,然而这方面的逻辑推理能力是一个可用的模型所必备,因此后续更新会关注提升相关能力。
## **更新**
本次更新主要包含以下几个方面相对于v1版本提升有效训练速度50%其中pad从30%减少至5%训练速度从3200token/s提升至3600token/s。0.95*3600/(0.7*3200)=1.527
1. 使用HuggingFace的datasets库进行数据读取具体流程如下
1. 使用transform函数将不同数据集的数据统一格式为{'text': 'xxx'}
2. 使用Tokenizer进行分词
3. 对长序列进行采样,目前提供三种模式,分别是:截断/采样参考Gopher论文/切分
4. 可选对来自不同doc的文本进行拼接。减少了数据中的pad加速训练在v1版本中pad占比为30%使用拼接后pad占比降低为5%。
2. 加入Trainer对于预训练和指令微调都可以复用见solver/trainer.py
3. 统一预训练和指令微调训练入口为train_lm.py
4. 提供更方便的配置可见configs/pretrain_config.yaml
5. 提供基于其他预训练模型补充词表,继续预训练功能
## **特性**
### 易用性
@ -50,9 +65,11 @@ Open-Llama是一个开源项目提供了一整套用于构建大型语言模
- **Fused CUDA kernel**:使用[xformers](https://github.com/facebookresearch/xformers)中提供的 fused CUDA kernel 可以将多个操作融合在一起,减少了 GPU 和 CPU 之间的数据传输,从而提高了训练效率。
- **并行化训练**:我们使用[Accelerate](https://huggingface.co/docs/accelerate/index)库支持在多个 GPU 上进行并行化训练,以加快训练速度。
对于7B模型使用Transformers中Pytorch原生版本的Llama模型训练训练速度为1378 token/s/gpu使用本代码库训练速度达到3290 token/s/gpu基本达到[Llama原文](https://arxiv.org/pdf/2302.13971.pdf)中的3370 token/s/gpu。
如果使用500B token进行预训练需要训练43000 GPU时。按照Google Cloud上A100-80G Spot的价格计算8卡每小时价格为12.6美元则总价格为67725美元。
当使用未加速版本训练时价格为158744美元。最终降低训练成本9万美元。
对于7B模型使用Transformers中Pytorch原生版本的Llama模型训练训练速度为**1378 token/s/gpu**,使用本代码库训练速度达到**3626 token/s/gpu**,超过[Llama原文](https://arxiv.org/pdf/2302.13971.pdf)中的**3370 token/s/gpu**。
如果使用500B token进行预训练需要训练38300 GPU时。按照Google Cloud上A100-80G Spot的价格计算8卡每小时价格为12.6美元则总价格为60300美元。
当使用未加速版本训练时价格为158744美元。最终降低训练成本9.8万美元。
更多测试可见[和其他开源模型性能对比](https://github.com/Bayes-Song/Open-Llama#%E5%92%8C%E5%85%B6%E4%BB%96%E5%BC%80%E6%BA%90%E6%A8%A1%E5%9E%8B%E6%80%A7%E8%83%BD%E5%AF%B9%E6%AF%94)。
### 通用性
@ -60,6 +77,7 @@ Open-Llama是一个开源项目提供了一整套用于构建大型语言模
- **多语言支持**:我们支持多种语言的语料库,包括英语、中文、日语等多种语言,让用户可以根据自己的需求进行选择。
- **领域通用性**:我们希望模型不仅能在日常问题上能产生帮助,同时希望在专业领域如科学、法律等也能帮助人类。
- **和世界交互**希望通过加入RL使得模型具备和世界交互的能力
## **要求**
@ -67,7 +85,7 @@ Open-Llama是一个开源项目提供了一整套用于构建大型语言模
- PyTorch 1.13
- 特殊版本的[Transformers库](https://github.com/Bayes-Song/transformers)
- [Accelerate库](https://huggingface.co/docs/accelerate/index)
- CUDA 11.6 或更高版本(用于 GPU 加速基于CUDA11.7进行测试
- CUDA 11.6 或更高版本(用于 GPU 加速)
## **入门指南**
### 安装
@ -83,6 +101,8 @@ pip install -r requirements.txt
目前给出了智源开源的悟道数据集和EleutherAI开源的the pile数据集。数据集下载和处理代码在data目录下。
其中悟道数据集由于需要同意一些协议才能下载因此可能需要修改一下download_wudao中的链接[悟道](https://data.baai.ac.cn/details/WuDaoCorporaText)。
**注意数据下载可能出现失败建议将script中的下载和处理分成两部分来运行可以将下载多运行机会会自动断点续传。**
运行下面的命令进行数据下载并进行分片
```bash
bash data/download_the_pile.sh
@ -102,23 +122,45 @@ The Pile
```
验证数据完整性可见 [issue](https://github.com/s-JoL/Open-Llama/issues/5)
### 相关工具
在utils目录中提供了训练分词/补充现有分词模型和转换ckpt的代码。
使用SentencePiece训练分词器参考如下命令
```bash
python3 utils/train_tokenizer.py
```
在configs中提供了只使用wudao数据集训练的4w词表的分词模型 4w_cn_vocab_wudao15.model
根据已有分词模型补充词表参考
```bash
python3 utils/merge_tokenizer.py
```
根据META官方的分词模型和上面的4w中文合并为中英文双语的分词模型 llama_tokenizer_extended.model
转换现有的Llama模型ckpt参考
```bash
python3 utils/convert_ckpt.py
```
### 数据读取
数据读取相关代码可见dataset目录其中包含根据下载的数据集使用SentencePiece训练分词模型以及根据分词器构建DataLoader。
数据读取相关代码可见dataset/dataset.py包含了预训练和指令微调数据的处理如需加入其他数据集只需要修改其中的transform函数
训练分词器使用如下命令
数据读取流程如下:
1. 使用transform函数将不同数据集的数据统一格式为{'text': 'xxx'}
2. 使用Tokenizer进行分词
3. 对长序列进行采样,目前提供三种模式,分别是:截断/采样参考Gopher论文/切分
4. 可选对来自不同doc的文本进行拼接。减少了数据中的pad加速训练在v1版本中pad占比为30%使用拼接后pad占比降低为5%。
使用如下命令查看DataLoader输出的结果并检查分词正确性
```bash
python3 dataset/train_tokenizer.py
python3 dataset/dataset.py
```
使用如下命令查看DataLoader输出的结果
```bash
python3 dataset/pretrain_dataset.py
```
### 模型结构
我们基于Transformers库中的[Llama](https://github.com/facebookresearch/llama)参考论文原文中的2.4 Efficient implementation一节进行了修改
同时还参考了一些其他论文引入了一些优化。具体来说我们引入了由META开源的[xformers库](https://github.com/facebookresearch/xformers)中的memory_efficient_attention操作来进行
Self Attention的计算这对于性能有明显的提升提升大约30%。
具体可以参见[modeling_llama.py](https://github.com/Bayes-Song/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L240)
具体可以参见[modeling_llama.py](https://github.com/s-JoL/transformers/blob/dev/src/transformers/models/open_llama/modeling_open_llama.py#L230)
同时我们还参考了[Bloom](https://huggingface.co/bigscience/bloom)对于Token Embedding引入了Stable Embedding以更好的稳定训练。
@ -127,7 +169,7 @@ Self Attention的计算这对于性能有明显的提升提升大约30%。
### 预训练
我们基于Accelerate库进行多GPU并行训练启动命令如下
```bash
accelerate launch --config_file configs/default_config.yaml pretrain_llama.py
accelerate launch --config_file configs/default_config.yaml train_lm.py --config configs/pretrain_config.yaml
```
某些情况下可能需要指定下列参数
```
@ -141,36 +183,38 @@ accelerate launch --config_file configs/default_config.yaml pretrain_llama.py
其中我们使用了DeepSpeed stage1以减少显存占用。accelerate相关配置可见configs/default_config.yaml。
训练相关超参数可见configs/train_config.py目前我们使用10W词表的7B Llama模型进行训练具体配置如下
训练相关超参数可见configs/pretrain_config.yaml
其中默认参数为使用LlamaTokenizer补充4w中文的词表的分词模型模型大小为7B具体配置如下
| max_length | batch_size | learning_rate | weight_decay | params | dimension | n heads | n layer | vocab_size |
|------------|------------------|---------------|--------------|--------|-----------|---------|---------|------------|
| 1024 | 2 | 2e-4 | 1e-1 | 6.88B | 4096 | 32 | 32 | 100000 |
| 2048 | 2 | 2e-4 | 1e-1 | 7.03B | 4096 | 32 | 32 | 68762 |
```
=========================================================================================================
Layer (type:depth-idx) Output Shape Param #
=========================================================================================================
LlamaForCausalLM [1, 64, 32, 128] --
├─LlamaModel: 1-1 [1, 64, 32, 128] --
│ └─Embedding: 2-1 [1, 64, 4096] 409,600,000
│ └─LayerNorm: 2-2 [1, 64, 4096] 8,192
└─ModuleList: 2-3 -- --
│ └─LlamaDecoderLayer: x32 [1, 64, 4096] 202,383,360 x 32
│ └─LlamaRMSNorm: 2-4 [1, 64, 4096] 4,096
=========================================================================================================
Total params: 6,885,879,808
Trainable params: 6,885,879,808
==============================================================================================================
Layer (type:depth-idx) Output Shape Param #
==============================================================================================================
OpenLlamaForCausalLM [1, 32, 64, 128] --
├─OpenLlamaModel: 1-1 [1, 32, 64, 128] --
│ └─Embedding: 2-1 [1, 64, 4096] 281,649,152
│ └─ModuleList: 2-2 -- --
│ └─OpenLlamaDecoderLayer: 3x32 [1, 64, 4096] 202,383,360
└─OpenLlamaRMSNorm: 2-3 [1, 64, 4096] 4,096
├─Linear: 1-2 [1, 64, 68762] 281,649,152
==============================================================================================================
Total params: 7,039,569,920
Trainable params: 7,039,569,920
Non-trainable params: 0
Total mult-adds (G): 6.89
Total mult-adds (G): 7.04
```
目前的进展
从头预训练Loss如下
![](assets/pretrain_loss.png)
### Instruction-Tuning
我们使用目前开源的个数据集进行Instruction-tuning后续会加入更多的任务以及自己构建的数据集。
我们使用目前开源的个数据集进行Instruction-tuning后续会加入更多的任务以及自己构建的数据集。
- [yizhongw/self_instruct](https://huggingface.co/datasets/yizhongw/self_instruct)
- [BelleGroup/train_0.5M_CN](https://huggingface.co/datasets/BelleGroup/train_0.5M_CN)
- [BelleGroup/train_1M_CN](https://huggingface.co/datasets/BelleGroup/train_1M_CN)
@ -185,14 +229,9 @@ Total mult-adds (G): 6.89
user: {prompt}\nsystem: {completion}</s>
```
具体训练代码和预训练基本一样,代码可见
```
instruction_tuning.py
```
启动命令也基本一致
启动命令和预训练基本一致
```bash
accelerate launch --config_file configs/default_config.yaml instruction_tuning.py
accelerate launch --config_file configs/default_config.yaml train_lm.py --config configs/instruct_config.yaml
```
某些情况下可能需要指定下列参数
```
@ -203,21 +242,23 @@ accelerate launch --config_file configs/default_config.yaml instruction_tuning.p
--machine_rank
```
过程中Loss如下基本在波动不怎么下降
过程中Loss如下总计使用3个epoch
![loss](assets/instruct_loss.png)
### RLHF
暂无
### Server
单轮对话使用server.py对于多轮对话使用chat_server.py
多轮对话使用chat_server.py
基于Gradio开发。
## 性能对比
### 训练框架
在训练框架方面我们测试了HuggingFace开源的Accelerate库和HPC-AI开源的ColossalAI我们测试在打满显卡时性能差异较小。因此最终选择了实现相对简单的Accelerate库作为训练框架
在训练框架方面我们测试了HuggingFace开源的Accelerate库pytorch-lightning和HPC-AI开源的ColossalAI我们测试在打满显卡时性能差异较小。因此最终选择了实现相对简单的Accelerate库作为训练框架
测试数据如下,测试过程中使用的模型结构为
测试代码可见utils/speed_test.py
测试过程中使用的模型结构为
| Model | n gpu | n layer | n heads | hidden size | vocab size | seq length |
|-------|-------|---------|---------|-------------|------------|------------|
| GPT2 | 2 | 6 | heads | 4096 | 250100 | 1024 |
@ -235,36 +276,34 @@ accelerate launch --config_file configs/default_config.yaml instruction_tuning.p
论文中提到对于6.7B模型使用了1T token进行训练最终的gpu时为82432因此可以计算出他的训练速度大致为3370 token/s/gpu。
当使用下面的优化后速度开源基本和论文中速度一致使用20x8 A100-80G进行测试。预计加入更多融合算子开源取得更好的性能。
| | V1 | V2 |
|---------------------|--------------|-----------------------|
| Model | Transformers | Transformers+xformers |
| Optimizer | Pytorch Adam | Fused Adam |
| DeepSpeed | stage2 | stage1 |
| Grad Accumulation | 4 | 12 |
| Return Padding Mask | yes | no |
| Speed token/s/gpu | 1378 | 3290 |
| | V1 | V2 |
|---------------------|--------------|------------------------------------|
| Dataset | self implemented | datasets |
| Model | Transformers | Transformers+xformers |
| Optimizer | Pytorch Adam | Fused Adam |
| DeepSpeed | stage2 | stage1 |
| Grad Accumulation | 4 | 12 |
| Return Padding Mask | yes | no |
| Speed token/s/gpu | 1378 | 3637 |
### 和其他开源模型性能对比
下表是一个对目前开源模型性能的一个总结使用GPU device均为A100由于模型大小各不相同结构也有一定差异难以准确的对比性能作为一个粗略估计可以认为速度和模型参数量基本呈反比关系这一点看Llama不同大小的模型可以得到印证。基于这个粗略估计可以看到使用本项目的性能明显由于其他项目。
| Model | Open-Llama | LLAMA | LLAMA | LLAMA | OPT | Bloom | GLM | GPT-NEOX | CPM-ANT | CodeGeeX |
|---------------------|------------|----------|---------|-----------|---------|--------------------|-------|----------|---------|-----------|
| Model size | 6.9B | 6.7B | 13B | 65B | 175B | 175B | 130B | 20B | 10B | 13B |
| Model size | 7.0B | 6.7B | 13B | 65B | 175B | 175B | 130B | 20B | 10B | 13B |
| Token | | 1T | 1T | 1.4T | 180B | 366B | 400B | 402B | 200B | 13.9B |
| GPU Hour | | 82,432 | 135,168 | 1,022,362 | 809,472 | 1,082,990 | 43776 | 175680 | 47040 | 3072 |
| speed token/s/gpu | 3290 | 3370 | 2055 | 380 | 61.8 | 93.9 | 105.7 | 635.6 | 1181 | 1257 |
| 相关依赖 | xformers | xformers | | | measeq | Megatron-DeepSpeed | | | BMtrain | MindSpore |
| speed token/s/gpu/B | 22701 | 22579 | 26715 | 24700 | 10815 | 16432 | 13741 | 12712 | 11810 | 16341 |
| speed token/s/gpu | 3637 | 3370 | 2055 | 380 | 61.8 | 93.9 | 105.7 | 635.6 | 1181 | 1257 |
| 相关依赖 | xformers | xformers | | | measeq | Megatron-DeepSpeed | | | BMtrain | MindSpore |
| speed token/s/gpu/B | 25728 | 22579 | 26715 | 24700 | 10815 | 16432 | 13741 | 12712 | 11810 | 16341 |
## 后续计划
1. 加入更多训练监控,比如训练数据类别的分布等,加入继续训练相关代码
2. 开源预训练好的多语言Llama 6.9B的checkpoint
3. 实现Instruction-tuning代码并开源相关checkpoint
4. 使用Gradio搭建在线Demo
5. 使用[Triton](https://github.com/openai/triton)加入更多高性能算子,进一步提升性能
6. 加入根据Common Crawl构建预训练数据集相关代码并开源相关数据集
7. 加入多模态训练代码
1. 加入RLHF代码
2. 使用[Triton](https://github.com/openai/triton)加入更多高性能算子,进一步提升性能
3. 加入根据Common Crawl构建预训练数据集相关代码并开源相关数据集
4. 加入多模态训练代码
## 引用

View File

@ -10,80 +10,111 @@
-->
# Open-Llama
Translated by ChatGPT.
[English README](https://github.com/Bayes-Song/Open-Llama/blob/main/README_en.md)
Open-Llama is an open source project that provides a complete set of training processes for building large-scale language models, from data preparation to tokenization, pre-training, instruction tuning, and reinforcement learning techniques such as RLHF.
Open-Llama is an open-source project that offers a complete training pipeline for building large language models, ranging from dataset preparation to tokenization, pre-training, prompt tuning, and the reinforcement learning technique RLHF.
## Progress
We completed pre-training on 300 billion tokens, with a total of 80,000 steps trained, using a global batch size of 4 million, consistent with Llama. We constructed the instruction-tuning dataset using a total of 7 parts of data, which the model has certain programming ability, mathematical ability, and multi-turn dialogue ability. For specific data, please refer to the instruction-tuning section.
**You can try this model directly from the [Demo](http://home.ustc.edu.cn/~sl9292/).**
[Demo](http://home.ustc.edu.cn/~sl9292/)
## **Performance**
We tested our model by referring to some tests for Wenxin Yiyuan. Original report can be found at Baidu ["Wenxin Yiyan" Test: What is the level of domestic generative AI?](https://www.8btc.com/article/6809666)
**By adopting the same evaluation method as the FastChat project, Open-Llama's performance is compared to GPT3.5s. After testing, it can reach 84% of GPT3.5's performance on Chinese questions.**
The results of our model are shown in the following figure, and more results are yet to be further tested. Due to domestic network problems, the use of the above Demo may result in a request loss situation. If there is no response for a long time, please refresh and try again.
**The training speed reaches 3620 tokens/s, faster than the 3370 tokens/s reported in the original Llama paper, reaching the current state-of-the-art level.**
![image1](assets/eng1.png)![image2](assets/eng2.png)![image3](assets/eng3.png)
The CheckPoint after Instruct-tuning is open-source on [s-JoL/Open-Llama-V1](https://huggingface.co/s-JoL/Open-Llama-V1). To use the CheckPoint, first, install the latest version of Transformers with the following command:
``` base
pip install git+https://github.com/s-JoL/transformers.git@dev
```
The CheckPoint after pre-training only is also uploaded to [s-JoL/Open-Llama-V1-pretrain](https://huggingface.co/s-JoL/Open-Llama-V1-pretrain).
The model [PR](https://github.com/huggingface/transformers/pull/22795) has been submitted for merging into the Transformers main branch.
Here is a demonstration of the model's ability in multi-turn dialogue about code.
We have completed 300B token pre-training, training a total of 80 K steps. The Global Batch Size is consistent with Llama at 4M.
Using a total of 7 parts of data to constitute the Instruction-tuning data, the model has certain programming abilities, mathematical abilities, and multi-turn dialogue abilities. Specific data can be found in the Instruction-Tuning section.
Below is a display of the model's multi-turn dialogue ability regarding code:
![image4](assets/multiturn_chat_en.jpeg)
We roughly estimate the cost to achieve the above results. The 40K-step pre-training used 150 million pre-training data, which is about 110B tokens. The total training time is 76 hours, and the cost is about $19,152 according to Google Cloud's A100 quotation. The Instruction-tuning training was carried out for 12k steps, using 1.6 million data, and the total training time was 3.4 hours, costing about $342. Therefore, the total cost of training such a model from scratch is less than $20,000.
## **Updates**
This update mainly includes the following aspects, increasing the effective training speed by 50% compared to the v1 version, reducing padding from 30% to 5%, and improving training speed from 3200 tokens/s to 3600 tokens/s. 0.95 * 3600 / (0.7 * 3200) = 1.527
1. Use HuggingFace's datasets library for data reading, with the process as follows:
1. Use the transform function to unify data formats from different datasets to {'text': 'xxx'}
2. Tokenize using Tokenizer
3. Sample long sequences; currently, three modes are provided: truncation, sampling (refer to the Gopher paper), and splitting
4. Optional: concatenate texts from different docs, reducing padding in the data and accelerating training. In the v1 version, padding accounted for 30%; after concatenation, padding is reduced to 5%.
2. Add Trainer, which can be reused for both pre-training and instruction fine-tuning, see solver/trainer.py
3. Unify the pre-training and instruction fine-tuning training entry to train_lm.py
4. Provide more convenient configuration, see configs/pretrain_config.yaml
5. Provide functionality to continue pre-training based on other pre-trained models and supplementing vocabulary
Currently, the model's performance in both mathematical and code-related tasks is noticeably poor. This is partially due to the training data used, but I also believe it is due to the size of the model. However, the ability to perform logical reasoning is essential for any usable model. Therefore, future updates will focus on improving this aspect of the model's capabilities.
## **Features**
### Ease of Use
We believe that ease of use is one of the most important features when building large-scale language models. To make Open-Llama more accessible, we focus on the following:
- **Minimal implementation**: We use the simplest implementation approach to reduce the barrier to entry and make it easy for beginners to get started.
- **Complete workflow**: We provide complete code from data set construction to training, making each step of building a large language model clear and visible.
### Easy to use
### High Performance
Since training large language models is costly, high performance is also crucial when building large-scale language models. To achieve high-performance training, we employ the following techniques:
We believe that ease of use is one of the most important features when building large language models. To make Open-LLAMA more accessible, we have focused on the following aspects:
- **Fused CUDA kernel**: Using fused CUDA kernels provided by [xformers](https://github.com/facebookresearch/xformers) can fuse multiple operations together, reducing data transfer between GPU and CPU, and improving training efficiency.
- **Parallel training**: We use the [Accelerate](https://huggingface.co/docs/accelerate/index) library to support parallel training on multiple GPUs, accelerating the training process.
- **Minimal implementation**: We have adopted the simplest implementation methods, lowering the entry threshold and allowing beginners to get started with ease.
- **Complete pipeline**: We have published the complete code from dataset construction to training, making every step in the process of building a large language model clear and visible.
### High performance
For 7B mode, the training speed of the Llama model using the PyTorch native version in the Transformers library is 1378 tokens/s/GPU. With our code, the training speed reaches 3290 tokens/s/GPU, which is close to the reported 3370 tokens/s/GPU in the [Llama paper](https://arxiv.org/pdf/2302.13971.pdf).
If we pretrain with 500 billion tokens, it will take 43,000 GPU hours. Assuming the price of A100-80G Spot on Google Cloud is $12.6 per hour for 8 GPUs, the total cost will be $67,725.
Without acceleration, the cost would be $158,744. Our method reduces the training cost by $90,019 in total.
More comparison can be found in [Comparison of Performance with Other Open-Source Models](https://github.com/Bayes-Song/Open-Llama/blob/main/README_en.md#performance-comparison-with-other-open-source-models).
### Universality
When training language models, we aim to build a universal model that can be used for different languages and fields. To achieve this, we adopt the following strategies:
Due to the high cost of training large language models, high performance is also crucial when building them. To achieve high-performance training, we have employed the following techniques:
- **Fused CUDA kernel**: Using the fused CUDA kernel provided in [xformers](https://github.com/facebookresearch/xformers) can fuse multiple operations, reducing data transfer between the GPU and CPU, thereby improving training efficiency.
- **Parallelized training**: We employ the [Accelerate](https://huggingface.co/docs/accelerate/index) library to support parallelized training on multiple GPUs to speed up the training process.
For a 7B model, the training speed with the native PyTorch Llama model in Transformers is **1378 tokens/s/GPU**. Using this codebase, the training speed reaches **3626 tokens/s/GPU**, exceeding **3370 tokens/s/GPU** reported in the [original Llama paper](https://arxiv.org/pdf/2302.13971.pdf).
If pre-training with 500B tokens, 38300 GPU hours are required. According to the hourly price for 8 A100-80G Spot GPUs on Google Cloud, which is 12.6 US dollars, the total cost is 60,300 US dollars.
When using the unaccelerated version for training, the cost is 158,744 US dollars. The final training cost is reduced by 98,000 US dollars.
For more testing, see [performance comparison with other open-source models](https://github.com/Bayes-Song/Open-Llama#%E5%92%8C%E5%85%B6%E4%BB%96%E5%BC%80%E6%BA%90%E6%A8%A1%E5%9E%8B%E6%80%A7%E8%83%BD%E5%AF%B9%E6%AF%94).
### Versatility
When training language models, our goal is to build a versatile model that can handle different languages and domains. To achieve this, we have employed the following strategies:
- **Multi-language support**: We support multiple language corpora, including English, Chinese, Japanese, and many other languages, allowing users to choose according to their requirements.
- **Domain versatility**: We hope that the model can not only help with everyday questions but also assist in professional domains such as science, law, etc.
- **Interaction with the world**: By incorporating reinforcement learning (RL), we hope to give the model the ability to interact with the world.
- **Multi-language support**: We support a variety of language corpora, including English, Chinese, Japanese, and other languages, allowing users to choose according to their needs.
- **Field universality**: We hope that the model can not only help with everyday problems but also assist in professional fields such as science and law.
## **Requirements**
- Python 3.7 or higher
- PyTorch 1.13
- Customized [Transformers library](https://github.com/Bayes-Song/transformers)
- Special version of [Transformers library](https://github.com/Bayes-Song/transformers)
- [Accelerate library](https://huggingface.co/docs/accelerate/index)
- CUDA 11.6 or higher version (for GPU acceleration, tested based on CUDA 11.7)
- CUDA 11.6 or higher (for GPU acceleration)
## **Getting Started**
### Installation
Use the following command to install the required dependencies:
Use the following command to install related dependencies:
```bash
pip install -r requirements.txt
```
### Dataset Preparation
Currently, we provide the Wudao dataset from ZhuiyiAI and The Pile dataset from EleutherAI. The code for downloading and processing the datasets can be found in the data directory. Please note that the Wudao dataset requires agreeing to some agreements before downloading, so you may need to modify the link in download_wudao.sh. [WuDao](https://data.baai.ac.cn/details/WuDaoCorporaText)
Use the following commands to download and shard the data:
Currently provided are the Wudao dataset open-sourced by Zhiyuan and the Pile dataset open-sourced by EleutherAI. Dataset download and processing scripts are located in the data directory.
Due to the required agreement for downloading the Wudao dataset, you may need to modify the link in download_wudao. [Wudao](https://data.baai.ac.cn/details/WuDaoCorporaText).
**Note that data download may fail. It is recommended to divide the download and processing in the script into two parts for multiple attempts, which will automatically resume downloads from breakpoints.**
Run the following commands to download the data and perform partitioning:
```bash
bash data/download_the_pile.sh
bash data/download_wudao.sh
```
The data will be stored as small files with a maximum of 16,384 lines per file for efficient multi-processing training. The storage format is jsonl.zst compressed with zstd, resulting in a total data size of 519.5 GB and 16,466 files.
The data will be stored as small files, with a maximum of 16384 lines per file, for easy reading during multi-process training. The storage format is jsonl.zst, compressed using zstd, with a final data size of 519.5 GB, consisting of 16,466 files in total.
The Pile dataset contains 210,607,728 rows of JSON lines, and the Wudao dataset contains 59,132,213 rows of JSON lines.
The Pile dataset contains 210,607,728 JSON lines, while the Wudao dataset contains 59,132,213 JSON lines.
The specific data format is as follows:
```
WuDao
{'id': 1, 'dataType': '百科', 'title': 'some title', 'content': 'some content'}
@ -91,36 +122,68 @@ WuDao
The Pile
{'text': 'some text', 'meta': {'pile_set_name': 'Github'}}
```
Check the data integrity in [issue](https://github.com/s-JoL/Open-Llama/issues/5).
Verification of data intergrity can be foud in this [issue]((https://github.com/s-JoL/Open-Llama/issues/5)
### Related Tools
In the utils directory, training tokenizer/supplementing existing tokenizer models and conversion checkpoint code are provided.
Use SentencePiece to train a tokenizer with the following command:
```bash
python3 utils/train_tokenizer.py
```
In configs, a tokenizer model with a 40k vocabulary, trained only using the Wudao dataset (4w_cn_vocab_wudao15.model), is provided.
To supplement the vocabulary based on an existing tokenizer model, refer to:
```bash
python3 utils/merge_tokenizer.py
```
A bilingual English and Chinese tokenizer model (llama_tokenizer_extended.model) is created by merging the META official tokenizer model with the 40k Chinese tokenizer mentioned above.
To convert existing Llama model checkpoints, refer to:
```bash
python3 utils/convert_ckpt.py
```
### Data Loading
The code for loading data can be found in the dataset directory, which includes training a tokenizer using SentencePiece and constructing a DataLoader based on the tokenizer.
Train the tokenizer with the following command:
Data loading-related code can be found in dataset/dataset.py, which includes pre-training and instruction fine-tuning data processing. To add other datasets, only the transform function needs to be modified.
The data loading process is as follows:
1. Use the transform function to unify data formats from different datasets to {'text': 'xxx'}
2. Tokenize using Tokenizer
3. Sample long sequences; currently, three modes are provided: truncation, sampling (refer to the Gopher paper), and splitting
4. Optional: concatenate texts from different docs, reducing padding in the data and accelerating training. In the v1 version, padding accounted for 30%; after concatenation, padding is reduced to 5%.
Use the following command to view the output of DataLoader and check the correctness of tokenization:
```bash
python3 dataset/train_tokenizer.py
python3 dataset/dataset.py
```
Check the DataLoader output with the following command:
```bash
python3 dataset/pretrain_dataset.py
```
### Model Structure
We modified the [Llama](https://github.com/facebookresearch/llama) model in the Transformers library based on section 2.4 "Efficient Implementation" in the original paper and introduced some optimizations from other papers. Specifically, we introduced the memory_efficient_attention operation from the [xformers library](https://github.com/facebookresearch/xformers) by META for computing self-attention, which significantly improves performance by about 30%. Please refer to modeling_llama.py for details.
We also referred to Bloom for introducing stable embeddings for better training of token embeddings.
We modified according to the section 2.4 Efficient implementation of the [Llama](https://github.com/facebookresearch/llama) paper in the Transformers library, and also referenced other papers to introduce some optimizations. Specifically, we used the memory_efficient_attention operation from the [xformers library](https://github.com/facebookresearch/xformers) open-sourced by META for Self Attention computation, which has a significant performance improvement of approximately 30%. Further details can be found in [modeling_llama.py](https://github.com/s-JoL/transformers/blob/dev/src/transformers/models/open_llama/modeling_open_llama.py#L230).
Finally, we referred to PALM and used shared input-output embeddings.
Additionally, we referred to [Bloom](https://huggingface.co/bigscience/bloom) and introduced Stable Embedding for Token Embedding to better stabilize training.
### Pretraining
We use the Accelerate library for multi-GPU parallel training. Launch training with the following command:
Finally, we referenced [PALM](https://arxiv.org/abs/2204.02311) and employed Shared Input-Output Embeddings.
### Pre-training
We use multi-GPU parallel training based on the Accelerate library, with the following start command:
```bash
accelerate launch --config_file configs/default_config.yaml pretrain_llama.py
accelerate launch --config_file configs/default_config.yaml train_lm.py --config configs/pretrain_config.yaml
```
In some cases, it may be necessary to specify the following parameters.
In some cases, you may need to specify the following parameters:
```
--main_process_ip
--main_process_port
@ -128,63 +191,68 @@ In some cases, it may be necessary to specify the following parameters.
--num_machines
--machine_rank
```
We use [Wandb](https://wandb.ai/) for training visualization and you need to modify the environment variable WANDB_API_KEY.
We use DeepSpeed stage 1 to reduce GPU memory usage. Accelerate-related configurations can be found in configs/default_config.yaml.
We use [Wandb](https://wandb.ai/) for visualizing training. You need to modify the WANDB_API_KEY environment variable yourself.
The training-related hyperparameters can be found in configs/train_config.py. We currently train a 7B Llama model with a vocabulary size of 100,000, and the specific configuration is as follows:
Among them, we use DeepSpeed stage1 to reduce memory usage. For Accelerate-related configurations, see configs/default_config.yaml.
Training related hyperparameters can be found in configs/pretrain_config.yaml.
The default parameters use LlamaTokenizer with a supplemented 40k Chinese vocabulary tokenizer model, and the model size is 7B. The specific configuration is as follows:
| max_length | batch_size | learning_rate | weight_decay | params | dimension | n heads | n layer | vocab_size |
|------------|------------------|---------------|--------------|--------|-----------|---------|---------|------------|
| 1024 | 2 | 2e-4 | 1e-1 | 6.88B | 4096 | 32 | 32 | 100000 |
| 2048 | 2 | 2e-4 | 1e-1 | 7.03B | 4096 | 32 | 32 | 68762 |
```
=========================================================================================================
Layer (type:depth-idx) Output Shape Param #
=========================================================================================================
LlamaForCausalLM [1, 64, 32, 128] --
├─LlamaModel: 1-1 [1, 64, 32, 128] --
│ └─Embedding: 2-1 [1, 64, 4096] 409,600,000
│ └─LayerNorm: 2-2 [1, 64, 4096] 8,192
└─ModuleList: 2-3 -- --
│ └─LlamaDecoderLayer: x32 [1, 64, 4096] 202,383,360 x 32
│ └─LlamaRMSNorm: 2-4 [1, 64, 4096] 4,096
=========================================================================================================
Total params: 6,885,879,808
Trainable params: 6,885,879,808
==============================================================================================================
Layer (type:depth-idx) Output Shape Param #
==============================================================================================================
OpenLlamaForCausalLM [1, 32, 64, 128] --
├─OpenLlamaModel: 1-1 [1, 32, 64, 128] --
│ └─Embedding: 2-1 [1, 64, 4096] 281,649,152
│ └─ModuleList: 2-2 -- --
│ └─OpenLlamaDecoderLayer: 3x32 [1, 64, 4096] 202,383,360
└─OpenLlamaRMSNorm: 2-3 [1, 64, 4096] 4,096
├─Linear: 1-2 [1, 64, 68762] 281,649,152
==============================================================================================================
Total params: 7,039,569,920
Trainable params: 7,039,569,920
Non-trainable params: 0
Total mult-adds (G): 6.89
Total mult-adds (G): 7.04
```
Current Progress
![](assets/loss.png)
Pre-training loss from scratch is shown below:
![loss](assets/pretrain_loss.png)
### Instruction-Tuning
We performed instruction-tuning on three currently available open-source datasets, and we plan to add more tasks and our own constructed datasets in the future.
We use the currently available seven datasets for Instruction-tuning, and more tasks and our own datasets will be added later.
- [yizhongw/self_instruct](https://huggingface.co/datasets/yizhongw/self_instruct)
- [BelleGroup/generated_train_0.5M_CN](https://huggingface.co/datasets/BelleGroup/generated_train_0.5M_CN)
- [BelleGroup/generated_train_1M_CN](https://huggingface.co/datasets/BelleGroup/generated_train_1M_CN)
- [BelleGroup/train_0.5M_CN](https://huggingface.co/datasets/BelleGroup/train_0.5M_CN)
- [BelleGroup/train_1M_CN](https://huggingface.co/datasets/BelleGroup/train_1M_CN)
- [BelleGroup/multiturn_chat_0.8M](https://huggingface.co/datasets/BelleGroup/multiturn_chat_0.8M)
- [BelleGroup/school_math_0.25M](https://huggingface.co/datasets/BelleGroup/school_math_0.25M)
- [RyokoAI/ShareGPT52K](https://huggingface.co/datasets/RyokoAI/ShareGPT52K)
- [Graverman/Instruct-to-Code](https://huggingface.co/datasets/Graverman/Instruct-to-Code)
There were some issues with the handling of ShareGPT52K dataset in the processing of the datasets. We downloaded the original data again and reprocessed it.
We did some preprocessing on the raw data, the format is as follows:
The ShareGPT52K dataset has some issues in the datastes processing, so we directly downloaded the original data and reprocessed it.
We performed some preprocessing on the original data, with the format as follows:
```
user: {prompt}\nsystem: {completion}</s>
```
The training code is similar to pre-training and can be seen in
```
instruction_tuning.py
The startup command is basically the same as pre-training:
```bash
accelerate launch --config_file configs/default_config.yaml train_lm.py --config configs/instruct_config.yaml
```
The launch command is also similar to pre-training:
```bash
accelerate launch --config_file configs/default_config.yaml instruction_tuning.py
```
In some cases, the following parameters may need to be specified:
In some cases, you may need to specify the following parameters:
```
--main_process_ip
--main_process_port
@ -193,70 +261,78 @@ In some cases, the following parameters may need to be specified:
--machine_rank
```
The loss during the process is as follows, basically fluctuating and not decreasing much:
The loss during the process is shown below, with a total of 3 epochs:
![loss](assets/instruct_loss.png)
### RLHF
N/A
Not available yet.
### Server
Use server.py for single-turn conversation, and chat_server.py for multi-turn conversation.
For multi-turn dialogue, use chat_server.py.
Developed based on Gradio.
## Performance Comparison
### Training Framework
In terms of the training framework, we tested the HuggingFace's Accelerate library and HPC-AI's ColossalAI, and found that there was little difference in performance when running on fully utilized GPUs. Therefore, we ultimately chose the relatively simple Accelerate library as our training framework.
The test data is shown below, and the model structure used during testing is:
In terms of training frameworks, we tested HuggingFace's open-source Accelerate library, PyTorch Lightning, and HPC-AI's open-source ColossalAI. We found that their performance differences are relatively small when fully utilizing GPUs. Therefore, we chose the relatively simple-to-implement Accelerate library as the training framework.
The test code can be found in utils/speed_test.py.
The model structure used during the testing process is:
| Model | n gpu | n layer | n heads | hidden size | vocab size | seq length |
|-------|-------|---------|---------|-------------|------------|------------|
| GPT2 | 2 | 6 | heads | 4096 | 250100 | 1024 |
The test results are shown below, and we can see that there is little difference in speed and memory utilization when running on fully utilized GPUs:
The test results are shown below, indicating that when the GPUs are fully utilized, the differences in speed and memory consumption are not significant.
| | HuggingFace | HuggingFace | ColossalAI | ColossalAI | ColossalAI |
|-----------------|-----------------------------------|------------------------------------|--------------------------------------------------------|--------------------------------------------------------|------------------------------------|
| config | without activation ckpt, bs2 | without activation ckpt, max_bs=12 | with activation ckpt, bs2 | without activation ckpt, bs2 | without activation ckpt, max_bs=10 |
| second pre step | 0.336, fw=0.033, bw=0.3, opt=5e-6 | 1.25 | 0.347 | 0.308, fw=0.067, bw=0.152, opt=0.088 | 1.055 |
| gpu memory | nvidia-smi 45445 | | fw+bw+opt=21053.63+22064.12+17987.52, nvidia-smi 40961 | fw+bw+opt=24684.74+21087.13+17987.52, nvidia-smi 46821 | oom after 10 steps, suspected memory leak |
| gpu memory | nvidia-smi 45445 | | fw+bw+opt=21053.63+22064.12+17987.52, nvidia-smi 40961 | fw+bw+opt=24684.74+21087.13+17987.52, nvidia-smi 46821 | oom after 10 steps |
### Performance Optimization
In the earliest version, we used DeepSpeed stage2 and the native Llama implementation in Transformers for training. However, the speed was significantly different from what was reported in the paper. Therefore, we conducted a series of optimizations and list the performance improvements for each step below.
The paper mentions that they trained the 6.7B model with 1T tokens, and the GPU utilization was 82432, so the training speed was approximately 3370 tokens/s/GPU. After implementing the following optimizations, our speed is now comparable to that reported in the paper, using 20x8 A100-80G for testing. We expect to achieve better performance by adding more fusion operators in the future.
In the earliest version, we used the native Llama implementation from DeepSpeed stage2 + Transformers for training. However, the speed was significantly different from what was claimed in the paper. Therefore, we carried out a series of optimizations afterwards, and we list each step of the performance improvement below for reference.
| | V1 | V2 |
|---------------------|--------------|-----------------------|
| Model | Transformers | Transformers+xformers |
| Optimizer | Pytorch Adam | Fused Adam |
| DeepSpeed | stage2 | stage1 |
| Grad Accumulation | 4 | 12 |
| Return Padding Mask | yes | no |
| Speed token/s/gpu | 1378 | 3290 |
The paper mentioned that for the 6.7B model, 1T token was used for training and the final GPU time was 82432, from which the training speed was roughly calculated as 3370 token/s/gpu. After using the following optimizations, the speed is now basically consistent with what was claimed in the paper when tested on 20x8 A100-80G. It is expected that more fusion operators will be added in the future to achieve better performance.
### Performance Comparison with Other Open-source Models
The following table summarizes the performance of current open-source models, all tested on A100 GPUs. Due to differences in model sizes and structures, it is difficult to make accurate performance comparisons. As a rough estimate, it can be assumed that speed and model parameter count are inversely proportional, as evidenced by Llama models of different sizes. Based on this rough estimate, it can be seen that the performance using our project is significantly better than other projects.
| | V1 | V2 |
|---------------------|--------------|------------------------------------|
| Dataset | self implemented | datasets |
| Model | Transformers | Transformers+xformers |
| Optimizer | Pytorch Adam | Fused Adam |
| DeepSpeed | stage2 | stage1 |
| Grad Accumulation | 4 | 12 |
| Return Padding Mask | yes | no |
| Speed token/s/gpu | 1378 | 3637 |
### Comparison with Other Open-source Models
The following table summarizes the performance of currently available open-source models. In all cases, the GPU device used is A100. Due to differences in the size and structure of the models, it is difficult to make accurate performance comparisons. As a rough estimate, it can be assumed that the speed is generally inversely proportional to the size of the model parameters, which is confirmed by the performance of Llama with models of different sizes. Based on this rough estimate, it can be seen that the performance using our project is significantly better than that of other projects.
| Model | Open-Llama | LLAMA | LLAMA | LLAMA | OPT | Bloom | GLM | GPT-NEOX | CPM-ANT | CodeGeeX |
|---------------------|------------|----------|---------|-----------|---------|--------------------|-------|----------|---------|-----------|
| Model size | 6.9B | 6.7B | 13B | 65B | 175B | 175B | 130B | 20B | 10B | 13B |
| Model size | 7.0B | 6.7B | 13B | 65B | 175B | 175B | 130B | 20B | 10B | 13B |
| Token | | 1T | 1T | 1.4T | 180B | 366B | 400B | 402B | 200B | 13.9B |
| GPU Hour | | 82,432 | 135,168 | 1,022,362 | 809,472 | 1,082,990 | 43776 | 175680 | 47040 | 3072 |
| speed token/s/gpu | 3290 | 3370 | 2055 | 380 | 61.8 | 93.9 | 105.7 | 635.6 | 1181 | 1257 |
| Dependencies | xformers | xformers | | | measeq | Megatron-DeepSpeed | | | BMtrain | MindSpore |
| speed token/s/gpu/B | 22701 | 22579 | 26715 | 24700 | 10815 | 16432 | 13741 | 12712 | 11810 | 16341 |
| speed token/s/gpu | 3637 | 3370 | 2055 | 380 | 61.8 | 93.9 | 105.7 | 635.6 | 1181 | 1257 |
| 相关依赖 | xformers | xformers | | | measeq | Megatron-DeepSpeed | | | BMtrain | MindSpore |
| speed token/s/gpu/B | 25728 | 22579 | 26715 | 24700 | 10815 | 16432 | 13741 | 12712 | 11810 | 16341 |
## Future Plans
1. Add more training monitoring, such as the distribution of training data categories, and add code for continuing training.
2. Realease the pre-trained checkpoint for the multi-lingual Llama 6.9B model.
3. Implement instruction-tuning code and open-source related checkpoints.
Build an online demo using Gradio.
4. Use [Triton](https://github.com/openai/triton) to add more high-performance operators and further improve performance.
5. Add code for building pre-training datasets based on Common Crawl and open-source related datasets.
6. Add code for multi-modal training.
## Citation
1. Integrate RLHF code.
2. Use Triton to add more high-performance operators to further improve performance.
3. Add code for building pre-training datasets based on Common Crawl and open related datasets.
4. Add code for multimodal training.
## References
```
@misc{openllama,
title={Open-Llama},
@ -264,4 +340,4 @@ Build an online demo using Gradio.
year={2023},
howpublished={\url{https://github.com/Bayes-Song/Open-Llama}},
}
```
```

Binary file not shown.

Before

Width:  |  Height:  |  Size: 960 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 202 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 66 KiB

After

Width:  |  Height:  |  Size: 105 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 289 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 95 KiB

After

Width:  |  Height:  |  Size: 84 KiB

View File

@ -1,7 +1,9 @@
data:
mode: "pretrain"
data:
mixed: "data/pretrain_data/part-*.jsonl.zst"
wudao: "data/pretrain_data/part-wudao*.jsonl.zst"
# 由于加载了Llama模型的ckpt所以只使用少量英文数据
the_pile: "data/pretrain_data/part-pile-1*.jsonl.zst"
pad_to_max: False
sequence_sample_mode: "none"
concat_multiple_sequence: True
@ -16,18 +18,19 @@ model:
shared_input_output_embedding: False
train:
train_batch_size: 2
num_training_steps: 1000000
num_training_steps: 500000
num_warmup_steps: 2000
initializer_range: 1.0e-2
lr: 2.0e-4
weight_decay: 1.0e-1
# 加载预训练权重从头训练设为null
ckpt: "data/llama_raw_ckpt/7B/extended.pth"
train_num_workers: 16
gradient_accumulation_steps: 12
prefetch_factor: 100
# global step
log_interval: 5
eval_interval: 200
save_interval: 800
eval_interval: 500
save_interval: 1000
work_dir: "data/saved_ckpt/7B"
project_name: "Llama Pretrain"

View File

@ -153,40 +153,27 @@ def get_labels_gen(pad_token_id):
def construct_dataset(dataset_config, tokenizer, return_raw_text=False):
datasets = []
probabilities = []
# 暂时只使用一个,使用多个时无法使用多进程读取导致耗时较长
assert len(dataset_config["data"]) == 1
all_data_files = []
for name, pattern in dataset_config["data"].items():
data_files = glob(pattern)
assert len(data_files) > 0
dataset = load_dataset(
"json", data_files=data_files, split="train", streaming=True
)
# shuffle
dataset = dataset.shuffle()
# 文本预处理转换为统一格式
if dataset_config["mode"] == "pretrain":
dataset = dataset.map(pretrain_transform, batched=True, batch_size=1)
elif dataset_config["mode"] == "instruct":
dataset = dataset.map(instruct_transform, batched=True, batch_size=1)
dataset = dataset.select_columns("text")
dataset = dataset.map(split_multiturn, batched=True, batch_size=1)
else:
raise Exception(
"Dataset mode: {} not found.".format(dataset_config["mode"])
)
datasets.append(dataset)
probabilities.append(dataset.n_shards)
probabilities_sum = sum(probabilities)
# 多个数据部分按概率采样
probabilities = [p / probabilities_sum for p in probabilities]
if len(datasets) > 1:
full_dataset = interleave_datasets(
datasets, probabilities=probabilities, seed=42
)
all_data_files.extend(data_files)
dataset = load_dataset(
"json", data_files=all_data_files, split="train", streaming=True
)
# shuffle
dataset = dataset.shuffle()
# 文本预处理转换为统一格式
if dataset_config["mode"] == "pretrain":
dataset = dataset.map(pretrain_transform, batched=True, batch_size=1)
elif dataset_config["mode"] == "instruct":
dataset = dataset.map(instruct_transform, batched=True, batch_size=1)
dataset = dataset.select_columns("text")
dataset = dataset.map(split_multiturn, batched=True, batch_size=1)
else:
full_dataset = datasets[0]
raise Exception("Dataset mode: {} not found.".format(dataset_config["mode"]))
full_dataset = dataset
# to visualize
if return_raw_text:

View File

@ -72,6 +72,7 @@ def main(argv):
if config["train"]["ckpt"] is not None:
ckpt = torch.load(config["train"]["ckpt"])
raw_model.load_state_dict(ckpt)
print('Loaded ckpt from: {}'.format(config["train"]["ckpt"]))
trainer = Trainer(config, raw_model, train_loader, tokenizer, accelerator)
trainer.train()