* @Author: LiangSong(sl12160010@gmail.com)
* @Date: 2023-03-10 21:18:35
* @LastEditors: LiangSong(sl12160010@gmail.com)
* @LastEditTime: 2023-05-04 20:23:09
* @FilePath: /Open-Llama/README.md
* @LastEditTime: 2023-05-04 20:23:14
* @FilePath: /Open-Llama/README_en.md
* @Description:
* Copyright (c) 2023 by LiangSong(sl12160010@gmail.com), All Rights Reserved.
<img alt="GitHub last commit" src="https://img.shields.io/github/last-commit/s-JoL/Open-Llama">
Open-Llama是一个开源项目,提供了一整套用于构建大型语言模型的训练流程,从数据集准备到分词、预训练、指令调优,以及强化学习技术 RLHF。
Open-Llama is an open-source project that offers a complete training pipeline for building large language models, ranging from dataset preparation to tokenization, pre-training, prompt tuning, and the reinforcement learning technique RLHF.
**You can try this model directly from the [Demo](http://home.ustc.edu.cn/~sl9292/).**
## **主要内容**
## **Main contents**
- **支持Transformers/HuggingFace直接调用。** 经过Instruct-tuning的CheckPoint已开源在[HuggingFace: s-JoL/Open-Llama-V1](https://huggingface.co/s-JoL/Open-Llama-V1)。
- **Support Transformers/HuggingFace.** The CheckPoint after Instruct-tuning is open-source on [HuggingFace: s-JoL/Open-Llama-V1](https://huggingface.co/s-JoL/Open-Llama-V1).
- **采用FastChat项目相同方法测评Open-Llama的效果和GPT3.5的效果对比经过测试在中文问题上可以达到GPT3.5 84%的水平。**
- **By adopting the same evaluation method as the FastChat project, Open-Llama's performance is compared to GPT3.5s. After testing, it can reach 84% of GPT3.5's performance on Chinese questions.**
- **训练速度达到3620 token/s快于Llama原文中的3370 token/s达到目前sota的水平。**
- **The training speed reaches 3620 tokens/s, faster than the 3370 tokens/s reported in the original Llama paper, reaching the current state-of-the-art level.**
To use the CheckPoint, first, install the latest version of Transformers with the following command:
``` python
pip install git+https://github.com/huggingface/transformers.git
@ -48,103 +47,105 @@ pred = model.generate(**inputs, max_new_tokens=512, do_sample=True)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
模型已提交[PR](https://github.com/huggingface/transformers/pull/22795)合并至Transformers main分支。
The CheckPoint after pre-training only is also uploaded to [s-JoL/Open-Llama-V1-pretrain](https://huggingface.co/s-JoL/Open-Llama-V1-pretrain).
The model [PR](https://github.com/huggingface/transformers/pull/22795) has been submitted for merging into the Transformers main branch.
我们完成了300B token的预训练总共训练80 K stepGlobal Batch Size和Llama中一致为4M。
We have completed 300B token pre-training, training a total of 80 K steps. The Global Batch Size is consistent with Llama at 4M.
Using a total of 7 parts of data to constitute the Instruction-tuning data, the model has certain programming abilities, mathematical abilities, and multi-turn dialogue abilities. Specific data can be found in the Instruction-Tuning section.
Below is a display of the model's multi-turn dialogue ability regarding code:
## **更新**
## **Updates**
**[2023.4.28] Release v2.0**
本次更新主要包含以下几个方面相对于v1版本提升有效训练速度**50%**其中pad从**30%**减少至**5%**,训练速度从**3200token/s**提升至**3600token/s**。0.95 * 3600/(0.7 * 3200)=1.527
1. 使用HuggingFace的datasets库进行数据读取具体流程如下
1. 使用transform函数将不同数据集的数据统一格式为{'text': 'xxx'}
2. 使用Tokenizer进行分词
3. 对长序列进行采样,目前提供三种模式,分别是:截断/采样(参考[Gopher论文](https://arxiv.org/abs/2112.11446)/切分
4. 可选对来自不同doc的文本进行拼接。减少了数据中的pad加速训练在v1版本中pad占比为**30%**使用拼接后pad占比降低为**5%**。
2. 加入Trainer对于预训练和指令微调都可以复用见solver/trainer.py
3. 统一预训练和指令微调训练入口为train_lm.py
4. 提供更方便的配置可见configs/pretrain_config.yaml
5. 提供基于其他预训练模型补充词表,继续预训练功能
6. 支持从中断点继续训练,包括加载优化器参数/学习率和跳过重复数据
This update mainly includes the following aspects, increasing the effective training speed by **50%** compared to the v1 version, reducing padding from **30%** to **5%**, and improving training speed from **3200 tokens/s** to **3600 tokens/s**. 0.95 * 3600 / (0.7 * 3200) = 1.527
1. Use HuggingFace's datasets library for data reading, with the process as follows:
1. Use the transform function to unify data formats from different datasets to {'text': 'xxx'}
2. Tokenize using Tokenizer
3. Sample long sequences; currently, three modes are provided: truncation, sampling (refer to the [Gopher paper](https://arxiv.org/abs/2112.11446)), and splitting
4. Optional: concatenate texts from different docs, reducing padding in the data and accelerating training. In the v1 version, padding accounted for **30%**; after concatenation, padding is reduced to **5%**.
2. Add Trainer, which can be reused for both pre-training and instruction fine-tuning, see solver/trainer.py
3. Unify the pre-training and instruction fine-tuning training entry to train_lm.py
4. Provide more convenient configuration, see configs/pretrain_config.yaml
5. Provide functionality to continue pre-training based on other pre-trained models and supplementing vocabulary
6. Resuming training from a checkpoint is supported, including loading optimizer parameters/learning rate and skipping duplicate data
[2023.4.16] Release v1.0
Basic pre-training and instruction fine-tuning codes are provided, with a training speed comparable to that of the original Llama. The pre-trained and fine-tuned models are already open-sourced on HuggingFace.
v1 version code can be seen at https://github.com/s-JoL/Open-Llama/tree/v1.0
## **特性**
## **Features**
### 易用性
### Easy to use
我们认为易用性是构建大型语言模型时最重要的特性之一。为了使 Open-LLAMA 更加易于使用,我们特别注重了以下几点:
We believe that ease of use is one of the most important features when building large language models. To make Open-LLAMA more accessible, we have focused on the following aspects:
- **最简实现**:我们采用了最简单的实现方式,降低了入门的门槛,让初学者也能轻松上手。
- **流程完整**:我们发布了从数据集构建到训练的完整代码,使得构建一个大语言模型的每一步流程都清晰可见。
- **Minimal implementation**: We have adopted the simplest implementation methods, lowering the entry threshold and allowing beginners to get started with ease.
- **Complete pipeline**: We have published the complete code from dataset construction to training, making every step in the process of building a large language model clear and visible.
### 高性能
### High performance
Due to the high cost of training large language models, high performance is also crucial when building them. To achieve high-performance training, we have employed the following techniques:
- **Fused CUDA kernel**:使用[xformers](https://github.com/facebookresearch/xformers)中提供的 fused CUDA kernel 可以将多个操作融合在一起,减少了 GPU 和 CPU 之间的数据传输,从而提高了训练效率。
- **并行化训练**:我们使用[Accelerate](https://huggingface.co/docs/accelerate/index)库支持在多个 GPU 上进行并行化训练,以加快训练速度。
- **Fused CUDA kernel**: Using the fused CUDA kernel provided in [xformers](https://github.com/facebookresearch/xformers) can fuse multiple operations, reducing data transfer between the GPU and CPU, thereby improving training efficiency.
- **Parallelized training**: We employ the [Accelerate](https://huggingface.co/docs/accelerate/index) library to support parallelized training on multiple GPUs to speed up the training process.
对于7B模型使用Transformers中Pytorch原生版本的Llama模型训练训练速度为**1378 token/s/gpu**,使用本代码库训练速度达到**3626 token/s/gpu**,超过[Llama原文](https://arxiv.org/pdf/2302.13971.pdf)中的**3370 token/s/gpu**。
For a 7B model, the training speed with the native PyTorch Llama model in Transformers is **1378 tokens/s/GPU**. Using this codebase, the training speed reaches **3626 tokens/s/GPU**, exceeding **3370 tokens/s/GPU** reported in the [original Llama paper](https://arxiv.org/pdf/2302.13971.pdf).
如果使用500B token进行预训练需要训练38300 GPU时。按照Google Cloud上A100-80G Spot的价格计算8卡每小时价格为12.6美元则总价格为60300美元。
If pre-training with 500B tokens, 38300 GPU hours are required. According to the hourly price for 8 A100-80G Spot GPUs on Google Cloud, which is 12.6 US dollars, the total cost is 60,300 US dollars.
When using the unaccelerated version for training, the cost is 158,744 US dollars. The final training cost is reduced by 98,000 US dollars.
### 通用性
For more testing, see [performance comparison with other open-source models](https://github.com/Bayes-Song/Open-Llama#%E5%92%8C%E5%85%B6%E4%BB%96%E5%BC%80%E6%BA%90%E6%A8%A1%E5%9E%8B%E6%80%A7%E8%83%BD%E5%AF%B9%E6%AF%94).
### Versatility
- **多语言支持**:我们支持多种语言的语料库,包括英语、中文、日语等多种语言,让用户可以根据自己的需求进行选择。
- **领域通用性**:我们希望模型不仅能在日常问题上能产生帮助,同时希望在专业领域如科学、法律等也能帮助人类。
- **和世界交互**希望通过加入RL使得模型具备和世界交互的能力
When training language models, our goal is to build a versatile model that can handle different languages and domains. To achieve this, we have employed the following strategies:
## **要求**
- **Multi-language support**: We support multiple language corpora, including English, Chinese, Japanese, and many other languages, allowing users to choose according to their requirements.
- **Domain versatility**: We hope that the model can not only help with everyday questions but also assist in professional domains such as science, law, etc.
- **Interaction with the world**: By incorporating reinforcement learning (RL), we hope to give the model the ability to interact with the world.
- Python 3.7 或更高版本
## **Requirements**
- Python 3.7 or higher
- PyTorch 1.13
- 特殊版本的[Transformers库](https://github.com/Bayes-Song/transformers)
- [Accelerate](https://huggingface.co/docs/accelerate/index)
- CUDA 11.6 或更高版本(用于 GPU 加速)
- 硬件配置:目前使用(64 CPU, 1000G Memory, 8xA100-80G) x N有个比较神奇的现象当使用更多cpu时反而会慢一点猜测这和dataloader的多进程有一定关系。
- Special version of [Transformers library](https://github.com/Bayes-Song/transformers)
- [Accelerate library](https://huggingface.co/docs/accelerate/index)
- CUDA 11.6 or higher (for GPU acceleration)
- Hardware configuration: currently using (64 CPU, 1000G Memory, 8xA100-80G) x N. There is a rather curious phenomenon that when more CPUs are used, the system runs slightly slower. I speculate this may have something to do with the multi-processing of dataloader.
## **入门指南**
### 安装
## **Getting Started**
### Installation
Use the following command to install related dependencies:
pip install -r requirements.txt
### 数据集准备
### Dataset Preparation
目前给出了智源开源的悟道数据集和EleutherAI开源的the pile数据集。数据集下载和处理代码在data目录下。
Currently provided are the Wudao dataset open-sourced by Zhiyuan and the Pile dataset open-sourced by EleutherAI. Dataset download and processing scripts are located in the data directory.
Due to the required agreement for downloading the Wudao dataset, you may need to modify the link in download_wudao. [Wudao](https://data.baai.ac.cn/details/WuDaoCorporaText).
**Note that data download may fail. It is recommended to divide the download and processing in the script into two parts for multiple attempts, which will automatically resume downloads from breakpoints.**
Run the following commands to download the data and perform partitioning:
bash data/download_the_pile.sh
bash data/download_wudao.sh
The data will be stored as small files, with a maximum of 16384 lines per file, for easy reading during multi-process training. The storage format is jsonl.zst, compressed using zstd, with a final data size of 519.5 GB, consisting of 16,466 files in total.
其中the pile数据集包含210607728行json line悟道数据集包含59132213行json line。
The Pile dataset contains 210,607,728 JSON lines, while the Wudao dataset contains 59,132,213 JSON lines.
The specific data format is as follows:
{'id': 1, 'dataType': '百科', 'title': 'some title', 'content': 'some content'}
The Pile
{'text': 'some text', 'meta': {'pile_set_name': 'Github'}}
验证数据完整性可见 [issue](https://github.com/s-JoL/Open-Llama/issues/5)
Check the data integrity in [issue](https://github.com/s-JoL/Open-Llama/issues/5).
### 相关工具
### Related Tools
In the utils directory, training tokenizer/supplementing existing tokenizer models and conversion checkpoint code are provided.
Use SentencePiece to train a tokenizer with the following command:
python3 utils/train_tokenizer.py
在configs中提供了只使用wudao数据集训练的4w词表的分词模型 4w_cn_vocab_wudao15.model
In configs, a tokenizer model with a 40k vocabulary, trained only using the Wudao dataset (4w_cn_vocab_wudao15.model), is provided.
To supplement the vocabulary based on an existing tokenizer model, refer to:
python3 utils/merge_tokenizer.py
根据META官方的分词模型和上面的4w中文合并为中英文双语的分词模型 llama_tokenizer_extended.model
A bilingual English and Chinese tokenizer model (llama_tokenizer_extended.model) is created by merging the META official tokenizer model with the 40k Chinese tokenizer mentioned above.
To convert existing Llama model checkpoints, refer to:
python3 utils/convert_ckpt.py
### 数据读取
### Data Loading
1. 使用transform函数将不同数据集的数据统一格式为{'text': 'xxx'}
2. 使用Tokenizer进行分词
3. 对长序列进行采样,目前提供三种模式,分别是:截断/采样参考Gopher论文/切分
4. 可选对来自不同doc的文本进行拼接。减少了数据中的pad加速训练在v1版本中pad占比为30%使用拼接后pad占比降低为5%。
Data loading-related code can be found in dataset/dataset.py, which includes pre-training and instruction fine-tuning data processing. To add other datasets, only the transform function needs to be modified.
The data loading process is as follows:
1. Use the transform function to unify data formats from different datasets to {'text': 'xxx'}
2. Tokenize using Tokenizer
3. Sample long sequences; currently, three modes are provided: truncation, sampling (refer to the Gopher paper), and splitting
4. Optional: concatenate texts from different docs, reducing padding in the data and accelerating training. In the v1 version, padding accounted for 30%; after concatenation, padding is reduced to 5%.
Use the following command to view the output of DataLoader and check the correctness of tokenization:
python3 dataset/dataset.py
### 模型结构
我们基于Transformers库中的[Llama](https://github.com/facebookresearch/llama)参考论文原文中的2.4 Efficient implementation一节进行了修改
Self Attention的计算这对于性能有明显的提升提升大约30%。
### Model Structure
同时我们还参考了[Bloom](https://huggingface.co/bigscience/bloom)对于Token Embedding引入了Stable Embedding以更好的稳定训练。
We modified according to the section 2.4 Efficient implementation of the [Llama](https://github.com/facebookresearch/llama) paper in the Transformers library, and also referenced other papers to introduce some optimizations. Specifically, we used the memory_efficient_attention operation from the [xformers library](https://github.com/facebookresearch/xformers) open-sourced by META for Self Attention computation, which has a significant performance improvement of approximately 30%. Further details can be found in [modeling_llama.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/open_llama/modeling_open_llama.py#L229).
最后我们参考[PALM](https://arxiv.org/abs/2204.02311)使用了Shared Input-Output Embeddings。
Additionally, we referred to [Bloom](https://huggingface.co/bigscience/bloom) and introduced Stable Embedding for Token Embedding to better stabilize training.
Finally, we referenced [PALM](https://arxiv.org/abs/2204.02311) and employed Shared Input-Output Embeddings.
### Pre-training
We use multi-GPU parallel training based on the Accelerate library, with the following start command:
### 预训练
accelerate launch --config_file configs/default_config.yaml train_lm.py --config configs/pretrain_config.yaml
In some cases, you may need to specify the following parameters:
我们使用[Wandb](https://wandb.ai/)进行训练的可视化,需要自行修改环境变量 WANDB_API_KEY 。
其中我们使用了DeepSpeed stage1以减少显存占用。accelerate相关配置可见configs/default_config.yaml。
We use [Wandb](https://wandb.ai/) for visualizing training. You need to modify the WANDB_API_KEY environment variable yourself.
Among them, we use DeepSpeed stage1 to reduce memory usage. For Accelerate-related configurations, see configs/default_config.yaml.
Training related hyperparameters can be found in configs/pretrain_config.yaml.
The default parameters use LlamaTokenizer with a supplemented 40k Chinese vocabulary tokenizer model, and the model size is 7B. The specific configuration is as follows:
| max_length | batch_size | learning_rate | weight_decay | params | dimension | n heads | n layer | vocab_size |
@ -241,12 +253,14 @@ Non-trainable params: 0
Total mult-adds (G): 7.04
Pre-training loss from scratch is shown below:
### Instruction-Tuning
We use the currently available seven datasets for Instruction-tuning, and more tasks and our own datasets will be added later.
- [yizhongw/self_instruct](https://huggingface.co/datasets/yizhongw/self_instruct)
- [BelleGroup/train_0.5M_CN](https://huggingface.co/datasets/BelleGroup/train_0.5M_CN)
- [BelleGroup/train_1M_CN](https://huggingface.co/datasets/BelleGroup/train_1M_CN)
@ -255,17 +269,21 @@ Total mult-adds (G): 7.04
- [anon8231489123/ShareGPT_Vicuna_unfiltered](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered)
- [Graverman/Instruct-to-Code](https://huggingface.co/datasets/Graverman/Instruct-to-Code)
The ShareGPT_Vicuna_unfiltered dataset has some issues in the datastes processing, so we directly downloaded the original data and reprocessed it.
We performed some preprocessing on the original data, with the format as follows:
user: {prompt}\nsystem: {completion}</s>
The startup command is basically the same as pre-training:
accelerate launch --config_file configs/default_config.yaml train_lm.py --config configs/instruct_config.yaml
In some cases, you may need to specify the following parameters:
The loss during the process is shown below, with a total of 3 epochs:
### RLHF
Not available yet.
### Server
For multi-turn dialogue, use chat_server.py.
## 性能对比
Developed based on Gradio.
### 训练框架
## Performance Comparison
### Training Framework
In terms of training frameworks, we tested HuggingFace's open-source Accelerate library, PyTorch Lightning, and HPC-AI's open-source ColossalAI. We found that their performance differences are relatively small when fully utilizing GPUs. Therefore, we chose the relatively simple-to-implement Accelerate library as the training framework.
The test code can be found in utils/speed_test.py.
The model structure used during the testing process is:
| Model | n gpu | n layer | n heads | hidden size | vocab size | seq length |
| GPT2 | 2 | 6 | heads | 4096 | 250100 | 1024 |
The test results are shown below, indicating that when the GPUs are fully utilized, the differences in speed and memory consumption are not significant.
| | HuggingFace | HuggingFace | ColossalAI | ColossalAI | ColossalAI |
| config | without activation ckpt, bs2 | without activation ckpt, max_bs=12 | with activation ckpt, bs2 | without activation ckpt, bs2 | without activation ckpt, max_bs=10 |
| second pre step | 0.336, fw=0.033, bw=0.3, opt=5e-6 | 1.25 | 0.347 | 0.308, fw=0.067, bw=0.152, opt=0.088 | 1.055 |
| gpu memory | nvidia-smi 45445 | | fw+bw+opt=21053.63+22064.12+17987.52, nvidia-smi 40961 | fw+bw+opt=24684.74+21087.13+17987.52, nvidia-smi 46821 | oom after 10 steps, 疑似有内存泄漏 |
| gpu memory | nvidia-smi 45445 | | fw+bw+opt=21053.63+22064.12+17987.52, nvidia-smi 40961 | fw+bw+opt=24684.74+21087.13+17987.52, nvidia-smi 46821 | oom after 10 steps |
### 性能优化
在最早版本中我们使用DeepSpeed stage2 + Transformers中的原生Llama实现进行训练但是速度和论文中所说的相差较大因此后续我们进行了一系列的优化我们将每一步的性能提升列在下面可供参考。
### Performance Optimization
论文中提到对于6.7B模型使用了1T token进行训练最终的gpu时为82432因此可以计算出他的训练速度大致为3370 token/s/gpu。
当使用下面的优化后速度开源基本和论文中速度一致使用20x8 A100-80G进行测试。预计加入更多融合算子开源取得更好的性能。
In the earliest version, we used the native Llama implementation from DeepSpeed stage2 + Transformers for training. However, the speed was significantly different from what was claimed in the paper. Therefore, we carried out a series of optimizations afterwards, and we list each step of the performance improvement below for reference.
The paper mentioned that for the 6.7B model, 1T token was used for training and the final GPU time was 82432, from which the training speed was roughly calculated as 3370 token/s/gpu. After using the following optimizations, the speed is now basically consistent with what was claimed in the paper when tested on 20x8 A100-80G. It is expected that more fusion operators will be added in the future to achieve better performance.
| | V1 | V2 |
@ -318,8 +342,9 @@ accelerate launch --config_file configs/default_config.yaml train_lm.py --config
| Return Padding Mask | yes | no |
| Speed token/s/gpu | 1378 | 3637 |
### 和其他开源模型性能对比
下表是一个对目前开源模型性能的一个总结使用GPU device均为A100由于模型大小各不相同结构也有一定差异难以准确的对比性能作为一个粗略估计可以认为速度和模型参数量基本呈反比关系这一点看Llama不同大小的模型可以得到印证。基于这个粗略估计可以看到使用本项目的性能明显由于其他项目。
### Comparison with Other Open-source Models
The following table summarizes the performance of currently available open-source models. In all cases, the GPU device used is A100. Due to differences in the size and structure of the models, it is difficult to make accurate performance comparisons. As a rough estimate, it can be assumed that the speed is generally inversely proportional to the size of the model parameters, which is confirmed by the performance of Llama with models of different sizes. Based on this rough estimate, it can be seen that the performance using our project is significantly better than that of other projects.
| Model | Open-Llama | LLAMA | LLAMA | LLAMA | OPT | Bloom | GLM | GPT-NEOX | CPM-ANT | CodeGeeX |
@ -330,14 +355,14 @@ accelerate launch --config_file configs/default_config.yaml train_lm.py --config
| 相关依赖 | xformers | xformers | | | measeq | Megatron-DeepSpeed | | | BMtrain | MindSpore |
| speed token*params B/s/gpu | 25728 | 22579 | 26715 | 24700 | 10815 | 16432 | 13741 | 12712 | 11810 | 16341 |
## 后续计划
## Future Plans
1. 加入RLHF代码
2. 使用[Triton](https://github.com/openai/triton)加入更多高性能算子,进一步提升性能
3. 加入根据Common Crawl构建预训练数据集相关代码并开源相关数据集
4. 加入多模态训练代码
1. Integrate RLHF code.
2. Use Triton to add more high-performance operators to further improve performance.
3. Add code for building pre-training datasets based on Common Crawl and open related datasets.
4. Add code for multimodal training.
## 引用
## References
<!-- 一些之前没注意到的部分
1. [GPT3](https://arxiv.org/pdf/2005.14165.pdf), Details of Model Training
During training we always train on sequences of the full nctx = 2048 token context window, packing multiple documents into a single sequence when documents are shorter than 2048, in order to increase computational efficiency. Sequences with multiple documents are not masked in any special way but instead documents within a sequence are delimited with a special end of text token, giving the language model the information necessary to infer that context separated by the end of text token is unrelated. This allows for efficient training without need for any special sequence-specific masking.
Sequence length A sequence length of 2048 was used for all models. Input examples are concatenated together and then split into sequences of exactly 2048 tokens, so that there are no padding tokens, but examples may be split in the middle. Input examples are differentiated from one another with a special [eod] token.
2. GPT3, Common Crawl Filtering
使用高质量文本作为正例其他所有样本作为负例。根据判为正例的概率作为筛选np.random.pareto(α) > 1 document_score。
The classifier is trained using logistic regression classifier with features from Sparks standard tokenizer and HashingTF.
3. GPT3, fuzzy deduplication
we fuzzily deduplicated documents (i.e. removed documents with high overlap with other documents) within each dataset using Sparks MinHashLSH implementation with 10 hashes
4. GPT3, Test Set Contamination
5. [The pile](https://arxiv.org/pdf/2101.00027.pdf), BPB(bits per UTF-8 encoded byte)/bits per character/perplexity
BPB = = (L_T /L_B)l/ ln(2) \\
perplexity = P(w1, w2, w3, w4, ...)^{-\frac{1}{N}} \\
bpc=-\frac{1}{T}\sum_i log_2 P(w_i|w1, w2, ..., w_{i-1}) \\
2^{bpc}=(\prod_i P(w_i|w1, w2, ..., w_{i-1}))^{-\frac{1}{T}}=perplexity
6. The pile, diversity of the collected data
We hypothesize that this is due to the perplexity based filtering used in CC-100, where a language model is trained on Wikipedia and all data with a perplexity too high or too low is discarded. This effectively discards any data too similar to or too different from Wikipedia, which severely limits the diversity of the collected data.
7. The pile, bytes per token
Since the GPT-2 BPE tokenizer is trained on WebText, the mean bytes per token is also a very rough indicator of how syntactically different each Pile component is from WebText.
8. The pile, Deduplication
We used 10 hash functions for each Minhash and an approximate Jaccard similarity of 0.5.
9. GLM, Embedding Layer Gradient Shrink
和stable embedding类似
word-embedding = word-embedding*\alpha+word-embedding.detach() (1\alpha)
10. PALM, Training Instability
Instead, we found that a simple strategy to effectively mitigate the issue: We re-started training from a checkpoint roughly 100 steps before the spike started, and skipped roughly 200500 data batches, which cover the batches that were seen before and during the spike. With this mitigation, the loss did not spike again at the same point. We do not believe that the spikes were caused by “bad data” per se, because we ran several ablation experiments where we took the batches of data that were surrounding the spike, and then trained on those same data batches starting from a different, earlier checkpoint. In these cases, we did not see a spike. This implies that spikes only occur due to the combination of specific data batches with a particular model parameter state
11. [Chinchilla](https://arxiv.org/pdf/2203.15556.pdf), Optimal model scaling
20 tokens per parameter, for example 10B model should use 200B tokens to pretrain
12. [Gopher](https://arxiv.org/pdf/2112.11446.pdf), Quality Filtering
Quality Filtering (MassiveWeb only) The vast majority of text found on the web is of insufficient
quality to be useful for language model training. For example, many web pages contain primarily
automatically generated content, or text that is not intended for human consumption (such as keywords
for search-engine optimisation). Much of the web also comprises social media content, which can
variously lack context, coherence, or substance. To remove low-quality data while minimising potential
for bias, we apply a number of simple, easily understood heuristic filters: we remove any document
that does not contain between 50 and 100,000 words, or whose mean word length is outside the
range of 3 to 10 characters; we remove any document with a symbol-to-word ratio greater than 0.1
for either the hash symbol or the ellipsis; and we remove any document with more than 90% of lines
starting with a bullet point, or more than 30% ending with an ellipsis. We also require that 80%
of words in a document contain at least one alphabetic character, and apply a "stop word" filter, to
remove documents that do not contain at least two of the following English words: the, be, to, of, and,
that, have, with; this adequately deals with ostensibly English documents that contain no coherent
English text.
13. Gopher, Constructing Token Sequences

* @Author: LiangSong(sl12160010@gmail.com)
* @Date: 2023-03-10 21:18:35
* @LastEditors: LiangSong(sl12160010@gmail.com)
* @LastEditTime: 2023-05-04 20:23:09
* @FilePath: /Open-Llama/README.md
* @Description:
* Copyright (c) 2023 by LiangSong(sl12160010@gmail.com), All Rights Reserved.
[**中文**](./README.md) | [**English**](./README_en.md)
# Open-Llama
<p align="center">
<img alt="GitHub" src="https://img.shields.io/github/license/s-JoL/Open-Llama.svg?color=blue&style=flat-square">
<img alt="GitHub release (latest by date)" src="https://img.shields.io/github/v/release/s-JoL/Open-Llama">
<img alt="GitHub top language" src="https://img.shields.io/github/languages/top/s-JoL/Open-Llama">
<img alt="GitHub last commit" src="https://img.shields.io/github/last-commit/s-JoL/Open-Llama">
Open-Llama是一个开源项目提供了一整套用于构建大型语言模型的训练流程从数据集准备到分词、预训练、指令调优以及强化学习技术 RLHF。
## **主要内容**
- **支持Transformers/HuggingFace直接调用。** 经过Instruct-tuning的CheckPoint已开源在[HuggingFace: s-JoL/Open-Llama-V1](https://huggingface.co/s-JoL/Open-Llama-V1)。
- **采用FastChat项目相同方法测评Open-Llama的效果和GPT3.5的效果对比经过测试在中文问题上可以达到GPT3.5 84%的水平。**
- **训练速度达到3620 token/s快于Llama原文中的3370 token/s达到目前sota的水平。**
``` python
pip install git+https://github.com/huggingface/transformers.git
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("s-JoL/Open-Llama-V1", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("s-JoL/Open-Llama-V1").cuda()
inputs = tokenizer('user:implement quick sort in python\nsystem:', return_tensors='pt', return_attention_mask=False)
for k, v in inputs.items():
inputs[k] = v.cuda()
pred = model.generate(**inputs, max_new_tokens=512, do_sample=True)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
模型已提交[PR](https://github.com/huggingface/transformers/pull/22795)合并至Transformers main分支。
我们完成了300B token的预训练总共训练80 K stepGlobal Batch Size和Llama中一致为4M。
## **更新**
**[2023.4.28] Release v2.0**
本次更新主要包含以下几个方面相对于v1版本提升有效训练速度**50%**其中pad从**30%**减少至**5%**,训练速度从**3200token/s**提升至**3600token/s**。0.95 * 3600/(0.7 * 3200)=1.527
1. 使用HuggingFace的datasets库进行数据读取具体流程如下
1. 使用transform函数将不同数据集的数据统一格式为{'text': 'xxx'}
2. 使用Tokenizer进行分词
3. 对长序列进行采样,目前提供三种模式,分别是:截断/采样(参考[Gopher论文](https://arxiv.org/abs/2112.11446)/切分
4. 可选对来自不同doc的文本进行拼接。减少了数据中的pad加速训练在v1版本中pad占比为**30%**使用拼接后pad占比降低为**5%**。
2. 加入Trainer对于预训练和指令微调都可以复用见solver/trainer.py
3. 统一预训练和指令微调训练入口为train_lm.py
4. 提供更方便的配置可见configs/pretrain_config.yaml
5. 提供基于其他预训练模型补充词表,继续预训练功能
6. 支持从中断点继续训练,包括加载优化器参数/学习率和跳过重复数据
[2023.4.16] Release v1.0
## **特性**
### 易用性
我们认为易用性是构建大型语言模型时最重要的特性之一。为了使 Open-LLAMA 更加易于使用,我们特别注重了以下几点:
- **最简实现**:我们采用了最简单的实现方式,降低了入门的门槛,让初学者也能轻松上手。
- **流程完整**:我们发布了从数据集构建到训练的完整代码,使得构建一个大语言模型的每一步流程都清晰可见。
### 高性能
- **Fused CUDA kernel**:使用[xformers](https://github.com/facebookresearch/xformers)中提供的 fused CUDA kernel 可以将多个操作融合在一起,减少了 GPU 和 CPU 之间的数据传输,从而提高了训练效率。
- **并行化训练**:我们使用[Accelerate](https://huggingface.co/docs/accelerate/index)库支持在多个 GPU 上进行并行化训练,以加快训练速度。
对于7B模型使用Transformers中Pytorch原生版本的Llama模型训练训练速度为**1378 token/s/gpu**,使用本代码库训练速度达到**3626 token/s/gpu**,超过[Llama原文](https://arxiv.org/pdf/2302.13971.pdf)中的**3370 token/s/gpu**。
如果使用500B token进行预训练需要训练38300 GPU时。按照Google Cloud上A100-80G Spot的价格计算8卡每小时价格为12.6美元则总价格为60300美元。
### 通用性
- **多语言支持**:我们支持多种语言的语料库,包括英语、中文、日语等多种语言,让用户可以根据自己的需求进行选择。
- **领域通用性**:我们希望模型不仅能在日常问题上能产生帮助,同时希望在专业领域如科学、法律等也能帮助人类。
- **和世界交互**希望通过加入RL使得模型具备和世界交互的能力
## **要求**
- Python 3.7 或更高版本
- PyTorch 1.13
- 特殊版本的[Transformers库](https://github.com/Bayes-Song/transformers)
- [Accelerate库](https://huggingface.co/docs/accelerate/index)
- CUDA 11.6 或更高版本(用于 GPU 加速)
- 硬件配置:目前使用(64 CPU, 1000G Memory, 8xA100-80G) x N有个比较神奇的现象当使用更多cpu时反而会慢一点猜测这和dataloader的多进程有一定关系。
## **入门指南**
### 安装
pip install -r requirements.txt
### 数据集准备
目前给出了智源开源的悟道数据集和EleutherAI开源的the pile数据集。数据集下载和处理代码在data目录下。
bash data/download_the_pile.sh
bash data/download_wudao.sh
其中the pile数据集包含210607728行json line悟道数据集包含59132213行json line。
{'id': 1, 'dataType': '百科', 'title': 'some title', 'content': 'some content'}
The Pile
{'text': 'some text', 'meta': {'pile_set_name': 'Github'}}
验证数据完整性可见 [issue](https://github.com/s-JoL/Open-Llama/issues/5)
### 相关工具
python3 utils/train_tokenizer.py
在configs中提供了只使用wudao数据集训练的4w词表的分词模型 4w_cn_vocab_wudao15.model
python3 utils/merge_tokenizer.py
根据META官方的分词模型和上面的4w中文合并为中英文双语的分词模型 llama_tokenizer_extended.model
python3 utils/convert_ckpt.py
### 数据读取
1. 使用transform函数将不同数据集的数据统一格式为{'text': 'xxx'}
2. 使用Tokenizer进行分词
3. 对长序列进行采样,目前提供三种模式,分别是:截断/采样参考Gopher论文/切分
4. 可选对来自不同doc的文本进行拼接。减少了数据中的pad加速训练在v1版本中pad占比为30%使用拼接后pad占比降低为5%。
python3 dataset/dataset.py
### 模型结构
我们基于Transformers库中的[Llama](https://github.com/facebookresearch/llama)参考论文原文中的2.4 Efficient implementation一节进行了修改
Self Attention的计算这对于性能有明显的提升提升大约30%。
同时我们还参考了[Bloom](https://huggingface.co/bigscience/bloom)对于Token Embedding引入了Stable Embedding以更好的稳定训练。
最后我们参考[PALM](https://arxiv.org/abs/2204.02311)使用了Shared Input-Output Embeddings。
### 预训练
accelerate launch --config_file configs/default_config.yaml train_lm.py --config configs/pretrain_config.yaml
我们使用[Wandb](https://wandb.ai/)进行训练的可视化,需要自行修改环境变量 WANDB_API_KEY 。
其中我们使用了DeepSpeed stage1以减少显存占用。accelerate相关配置可见configs/default_config.yaml。
| max_length | batch_size | learning_rate | weight_decay | params | dimension | n heads | n layer | vocab_size |
| 2048 | 2 | 2e-4 | 1e-1 | 7.03B | 4096 | 32 | 32 | 68762 |
Layer (type:depth-idx) Output Shape Param #
OpenLlamaForCausalLM [1, 32, 64, 128] --
├─OpenLlamaModel: 1-1 [1, 32, 64, 128] --
│ └─Embedding: 2-1 [1, 64, 4096] 281,649,152
│ └─ModuleList: 2-2 -- --
│ │ └─OpenLlamaDecoderLayer: 3x32 [1, 64, 4096] 202,383,360
│ └─OpenLlamaRMSNorm: 2-3 [1, 64, 4096] 4,096
├─Linear: 1-2 [1, 64, 68762] 281,649,152
Total params: 7,039,569,920
Trainable params: 7,039,569,920
Non-trainable params: 0
Total mult-adds (G): 7.04
### Instruction-Tuning
- [yizhongw/self_instruct](https://huggingface.co/datasets/yizhongw/self_instruct)
- [BelleGroup/train_0.5M_CN](https://huggingface.co/datasets/BelleGroup/train_0.5M_CN)
- [BelleGroup/train_1M_CN](https://huggingface.co/datasets/BelleGroup/train_1M_CN)
- [BelleGroup/multiturn_chat_0.8M](https://huggingface.co/datasets/BelleGroup/multiturn_chat_0.8M)
- [BelleGroup/school_math_0.25M](https://huggingface.co/datasets/BelleGroup/school_math_0.25M)
- [anon8231489123/ShareGPT_Vicuna_unfiltered](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered)
- [Graverman/Instruct-to-Code](https://huggingface.co/datasets/Graverman/Instruct-to-Code)
user: {prompt}\nsystem: {completion}</s>
accelerate launch --config_file configs/default_config.yaml train_lm.py --config configs/instruct_config.yaml
### RLHF
### Server
## 性能对比
### 训练框架
| Model | n gpu | n layer | n heads | hidden size | vocab size | seq length |
| GPT2 | 2 | 6 | heads | 4096 | 250100 | 1024 |
| | HuggingFace | HuggingFace | ColossalAI | ColossalAI | ColossalAI |
| config | without activation ckpt, bs2 | without activation ckpt, max_bs=12 | with activation ckpt, bs2 | without activation ckpt, bs2 | without activation ckpt, max_bs=10 |
| second pre step | 0.336, fw=0.033, bw=0.3, opt=5e-6 | 1.25 | 0.347 | 0.308, fw=0.067, bw=0.152, opt=0.088 | 1.055 |
| gpu memory | nvidia-smi 45445 | | fw+bw+opt=21053.63+22064.12+17987.52, nvidia-smi 40961 | fw+bw+opt=24684.74+21087.13+17987.52, nvidia-smi 46821 | oom after 10 steps, 疑似有内存泄漏 |
### 性能优化
在最早版本中我们使用DeepSpeed stage2 + Transformers中的原生Llama实现进行训练但是速度和论文中所说的相差较大因此后续我们进行了一系列的优化我们将每一步的性能提升列在下面可供参考。
论文中提到对于6.7B模型使用了1T token进行训练最终的gpu时为82432因此可以计算出他的训练速度大致为3370 token/s/gpu。
当使用下面的优化后速度开源基本和论文中速度一致使用20x8 A100-80G进行测试。预计加入更多融合算子开源取得更好的性能。
| | V1 | V2 |
| Dataset | self implemented | datasets |
| Model | Transformers | Transformers+xformers |
| Optimizer | Pytorch Adam | Fused Adam |
| DeepSpeed | stage2 | stage1 |
| Grad Accumulation | 4 | 12 |
| Return Padding Mask | yes | no |
| Speed token/s/gpu | 1378 | 3637 |
### 和其他开源模型性能对比
下表是一个对目前开源模型性能的一个总结使用GPU device均为A100由于模型大小各不相同结构也有一定差异难以准确的对比性能作为一个粗略估计可以认为速度和模型参数量基本呈反比关系这一点看Llama不同大小的模型可以得到印证。基于这个粗略估计可以看到使用本项目的性能明显由于其他项目。
| Model | Open-Llama | LLAMA | LLAMA | LLAMA | OPT | Bloom | GLM | GPT-NEOX | CPM-ANT | CodeGeeX |
| Model size | 7.0B | 6.7B | 13B | 65B | 175B | 175B | 130B | 20B | 10B | 13B |
| Token | | 1T | 1T | 1.4T | 180B | 366B | 400B | 402B | 200B | 13.9B |
| GPU Hour | | 82,432 | 135,168 | 1,022,362 | 809,472 | 1,082,990 | 43776 | 175680 | 47040 | 3072 |
| speed token/s/gpu | 3637 | 3370 | 2055 | 380 | 61.8 | 93.9 | 105.7 | 635.6 | 1181 | 1257 |
| 相关依赖 | xformers | xformers | | | measeq | Megatron-DeepSpeed | | | BMtrain | MindSpore |
| speed token*params B/s/gpu | 25728 | 22579 | 26715 | 24700 | 10815 | 16432 | 13741 | 12712 | 11810 | 16341 |
## 后续计划
1. 加入RLHF代码
2. 使用[Triton](https://github.com/openai/triton)加入更多高性能算子,进一步提升性能
3. 加入根据Common Crawl构建预训练数据集相关代码并开源相关数据集
4. 加入多模态训练代码
## 引用
author={Liang Song},
<!-- 一些之前没注意到的部分
1. [GPT3](https://arxiv.org/pdf/2005.14165.pdf), Details of Model Training
During training we always train on sequences of the full nctx = 2048 token context window, packing multiple documents into a single sequence when documents are shorter than 2048, in order to increase computational efficiency. Sequences with multiple documents are not masked in any special way but instead documents within a sequence are delimited with a special end of text token, giving the language model the information necessary to infer that context separated by the end of text token is unrelated. This allows for efficient training without need for any special sequence-specific masking.
Sequence length A sequence length of 2048 was used for all models. Input examples are concatenated together and then split into sequences of exactly 2048 tokens, so that there are no padding tokens, but examples may be split in the middle. Input examples are differentiated from one another with a special [eod] token.
2. GPT3, Common Crawl Filtering
使用高质量文本作为正例其他所有样本作为负例。根据判为正例的概率作为筛选np.random.pareto(α) > 1 document_score。
The classifier is trained using logistic regression classifier with features from Sparks standard tokenizer and HashingTF.
3. GPT3, fuzzy deduplication
we fuzzily deduplicated documents (i.e. removed documents with high overlap with other documents) within each dataset using Sparks MinHashLSH implementation with 10 hashes
4. GPT3, Test Set Contamination
5. [The pile](https://arxiv.org/pdf/2101.00027.pdf), BPB(bits per UTF-8 encoded byte)/bits per character/perplexity
BPB = = (L_T /L_B)l/ ln(2) \\
perplexity = P(w1, w2, w3, w4, ...)^{-\frac{1}{N}} \\
bpc=-\frac{1}{T}\sum_i log_2 P(w_i|w1, w2, ..., w_{i-1}) \\
2^{bpc}=(\prod_i P(w_i|w1, w2, ..., w_{i-1}))^{-\frac{1}{T}}=perplexity
6. The pile, diversity of the collected data
We hypothesize that this is due to the perplexity based filtering used in CC-100, where a language model is trained on Wikipedia and all data with a perplexity too high or too low is discarded. This effectively discards any data too similar to or too different from Wikipedia, which severely limits the diversity of the collected data.
7. The pile, bytes per token
Since the GPT-2 BPE tokenizer is trained on WebText, the mean bytes per token is also a very rough indicator of how syntactically different each Pile component is from WebText.
8. The pile, Deduplication
We used 10 hash functions for each Minhash and an approximate Jaccard similarity of 0.5.
9. GLM, Embedding Layer Gradient Shrink
和stable embedding类似
word-embedding = word-embedding*\alpha+word-embedding.detach() (1\alpha)
10. PALM, Training Instability
Instead, we found that a simple strategy to effectively mitigate the issue: We re-started training from a checkpoint roughly 100 steps before the spike started, and skipped roughly 200500 data batches, which cover the batches that were seen before and during the spike. With this mitigation, the loss did not spike again at the same point. We do not believe that the spikes were caused by “bad data” per se, because we ran several ablation experiments where we took the batches of data that were surrounding the spike, and then trained on those same data batches starting from a different, earlier checkpoint. In these cases, we did not see a spike. This implies that spikes only occur due to the combination of specific data batches with a particular model parameter state
11. [Chinchilla](https://arxiv.org/pdf/2203.15556.pdf), Optimal model scaling
20 tokens per parameter, for example 10B model should use 200B tokens to pretrain
12. [Gopher](https://arxiv.org/pdf/2112.11446.pdf), Quality Filtering
Quality Filtering (MassiveWeb only) The vast majority of text found on the web is of insufficient
quality to be useful for language model training. For example, many web pages contain primarily
automatically generated content, or text that is not intended for human consumption (such as keywords
for search-engine optimisation). Much of the web also comprises social media content, which can
variously lack context, coherence, or substance. To remove low-quality data while minimising potential
for bias, we apply a number of simple, easily understood heuristic filters: we remove any document
that does not contain between 50 and 100,000 words, or whose mean word length is outside the
range of 3 to 10 characters; we remove any document with a symbol-to-word ratio greater than 0.1
for either the hash symbol or the ellipsis; and we remove any document with more than 90% of lines
starting with a bullet point, or more than 30% ending with an ellipsis. We also require that 80%
of words in a document contain at least one alphabetic character, and apply a "stop word" filter, to
remove documents that do not contain at least two of the following English words: the, be, to, of, and,
that, have, with; this adequately deals with ostensibly English documents that contain no coherent
English text.
13. Gopher, Constructing Token Sequences

View File

Author: LiangSong(sl12160010@gmail.com)
Date: 2023-04-06 22:30:10
LastEditors: LiangSong(sl12160010@gmail.com)
LastEditTime: 2023-05-04 22:32:05
LastEditTime: 2023-05-04 22:44:58
FilePath: /Open-Llama/chat_server.py
联系方式: sl12160010@gmail.com 对于该项目有任何意见和建议都欢迎联系我.
Contact information: sl12160010@gmail.com. Any opinions or suggestions regarding the project are welcome to be addressed to me through this email.