
365 lines
22 KiB
Raw Normal View History

2023-03-26 18:12:59 +00:00
* @Author: LiangSong(
* @Date: 2023-03-10 21:18:35
* @LastEditors: LiangSong(
2023-04-09 14:48:56 +00:00
* @LastEditTime: 2023-04-09 22:48:28
2023-04-02 13:32:39 +00:00
* @FilePath: /Open-Llama/
2023-03-26 18:12:59 +00:00
* @Description:
* Copyright (c) 2023 by LiangSong(, All Rights Reserved.
# Open-Llama
2023-03-26 18:36:01 +00:00
2023-03-26 18:12:59 +00:00
Open-Llama是一个开源项目提供了一整套用于构建大型语言模型的训练流程从数据集准备到分词、预训练、指令调优以及强化学习技术 RLHF。
2023-03-29 13:44:58 +00:00
## 进展
2023-04-09 14:48:56 +00:00
**采用FastChat项目相同方法测评Open-Llama的效果和GPT3.5的效果对比经过测试在中文问题上可以达到GPT3.5 84%的水平具体测试结果和CheckPoint将在近期放出**
2023-04-07 15:19:42 +00:00
我们完成了300B token的预训练总共训练80 K stepGlobal Batch Size和Llama中一致为4M。
2023-03-29 13:51:51 +00:00
2023-04-02 13:32:39 +00:00
2023-03-29 13:44:58 +00:00
2023-03-31 06:58:07 +00:00
我们参考一些对文心一言的测试也简单测试一下我们的模型,原始报道 [百度“文心一言”测试:国内生成式 AI 什么水平?](
2023-04-07 15:19:42 +00:00
2023-03-31 06:58:07 +00:00
我们简单预估一下达到上面效果的一个花费训练40K step使用了1.5亿条预训练数据大约为110B token总共训练时间76h按Google Cloud的A100报价花费大约为19152美元。后续的Instruction-tuning训练了12k Step使用1.6M条数据总共训练时间3.4h大约花费342美元。因此从0开始训练一个这样的模型总花费不到20000美元。
2023-03-31 07:11:12 +00:00
2023-03-26 18:12:59 +00:00
## **特性**
### 易用性
我们认为易用性是构建大型语言模型时最重要的特性之一。为了使 Open-LLAMA 更加易于使用,我们特别注重了以下几点:
- **最简实现**:我们采用了最简单的实现方式,降低了入门的门槛,让初学者也能轻松上手。
- **流程完整**:我们发布了从数据集构建到训练的完整代码,使得构建一个大语言模型的每一步流程都清晰可见。
### 高性能
2023-03-27 05:52:00 +00:00
- **Fused CUDA kernel**:使用[xformers](中提供的 fused CUDA kernel 可以将多个操作融合在一起,减少了 GPU 和 CPU 之间的数据传输,从而提高了训练效率。
- **并行化训练**:我们使用[Accelerate](库支持在多个 GPU 上进行并行化训练,以加快训练速度。
2023-03-26 18:12:59 +00:00
2023-03-27 05:52:00 +00:00
对于7B模型使用Transformers中Pytorch原生版本的Llama模型训练训练速度为1378 token/s/gpu使用本代码库训练速度达到3290 token/s/gpu基本达到[Llama原文](中的3370 token/s/gpu。
2023-03-27 05:39:01 +00:00
如果使用500B token进行预训练需要训练43000 GPU时。按照Google Cloud上A100-80G Spot的价格计算8卡每小时价格为12.6美元则总价格为67725美元。
2023-03-27 07:41:49 +00:00
2023-03-26 18:12:59 +00:00
### 通用性
- **多语言支持**:我们支持多种语言的语料库,包括英语、中文、日语等多种语言,让用户可以根据自己的需求进行选择。
- **领域通用性**:我们希望模型不仅能在日常问题上能产生帮助,同时希望在专业领域如科学、法律等也能帮助人类。
## **要求**
- Python 3.7 或更高版本
2023-03-27 06:10:07 +00:00
- PyTorch 1.13
- 特殊版本的[Transformers库](
2023-03-26 18:42:22 +00:00
- [Accelerate库](
2023-03-27 06:10:07 +00:00
- CUDA 11.6 或更高版本(用于 GPU 加速基于CUDA11.7进行测试)
2023-03-26 18:12:59 +00:00
## **入门指南**
### 安装
pip install -r requirements.txt
### 数据集准备
2023-03-26 18:36:01 +00:00
目前给出了智源开源的悟道数据集和EleutherAI开源的the pile数据集。数据集下载和处理代码在data目录下。
2023-03-26 18:12:59 +00:00
bash data/
bash data/
其中the pile数据集包含210607728行json line悟道数据集包含59132213行json line。
2023-03-26 18:16:13 +00:00
2023-03-26 18:12:59 +00:00
{'id': 1, 'dataType': '百科', 'title': 'some title', 'content': 'some content'}
The Pile
{'text': 'some text', 'meta': {'pile_set_name': 'Github'}}
### 数据读取
python3 dataset/
python3 dataset/
### 模型结构
2023-03-26 18:42:22 +00:00
我们基于Transformers库中的[Llama](参考论文原文中的2.4 Efficient implementation一节进行了修改
2023-03-26 18:12:59 +00:00
Self Attention的计算这对于性能有明显的提升提升大约30%。
同时我们还参考了[Bloom](对于Token Embedding引入了Stable Embedding以更好的稳定训练。
最后我们参考[PALM](使用了Shared Input-Output Embeddings。
### 预训练
accelerate launch --config_file configs/default_config.yaml
2023-03-27 08:31:19 +00:00
2023-03-27 05:52:00 +00:00
我们使用[Wandb](进行训练的可视化,需要自行修改环境变量 WANDB_API_KEY 。
2023-03-26 18:12:59 +00:00
其中我们使用了DeepSpeed stage1以减少显存占用。accelerate相关配置可见configs/default_config.yaml。
训练相关超参数可见configs/train_config.py目前我们使用10W词表的7B Llama模型进行训练具体配置如下
| max_length | batch_size | learning_rate | weight_decay | params | dimension | n heads | n layer | vocab_size |
| 1024 | 2 | 2e-4 | 1e-1 | 6.88B | 4096 | 32 | 32 | 100000 |
Layer (type:depth-idx) Output Shape Param #
LlamaForCausalLM [1, 64, 32, 128] --
├─LlamaModel: 1-1 [1, 64, 32, 128] --
│ └─Embedding: 2-1 [1, 64, 4096] 409,600,000
│ └─LayerNorm: 2-2 [1, 64, 4096] 8,192
│ └─ModuleList: 2-3 -- --
│ │ └─LlamaDecoderLayer: x32 [1, 64, 4096] 202,383,360 x 32
│ └─LlamaRMSNorm: 2-4 [1, 64, 4096] 4,096
Total params: 6,885,879,808
Trainable params: 6,885,879,808
Non-trainable params: 0
Total mult-adds (G): 6.89
2023-03-27 05:39:01 +00:00
2023-03-31 06:58:07 +00:00
2023-03-27 05:39:01 +00:00
2023-03-26 18:12:59 +00:00
### Instruction-Tuning
2023-03-31 06:58:07 +00:00
- [yizhongw/self_instruct](
2023-04-07 15:19:42 +00:00
- [BelleGroup/train_0.5M_CN](
- [BelleGroup/train_1M_CN](
- [BelleGroup/multiturn_chat_0.8M](
- [BelleGroup/school_math_0.25M](
- [RyokoAI/ShareGPT52K](
- [Graverman/Instruct-to-Code](
2023-03-31 06:58:07 +00:00
2023-04-07 15:19:42 +00:00
user: {prompt}\nsystem: {completion}</s>
2023-03-31 06:58:07 +00:00
accelerate launch --config_file configs/default_config.yaml
2023-03-26 18:12:59 +00:00
### RLHF
2023-04-07 15:19:42 +00:00
### Server
2023-03-26 18:12:59 +00:00
2023-04-07 15:19:42 +00:00
2023-03-26 18:12:59 +00:00
## 性能对比
### 训练框架
| Model | n gpu | n layer | n heads | hidden size | vocab size | seq length |
| GPT2 | 2 | 6 | heads | 4096 | 250100 | 1024 |
| | HuggingFace | HuggingFace | ColossalAI | ColossalAI | ColossalAI |
| config | without activation ckpt, bs2 | without activation ckpt, max_bs=12 | with activation ckpt, bs2 | without activation ckpt, bs2 | without activation ckpt, max_bs=10 |
| second pre step | 0.336, fw=0.033, bw=0.3, opt=5e-6 | 1.25 | 0.347 | 0.308, fw=0.067, bw=0.152, opt=0.088 | 1.055 |
| gpu memory | nvidia-smi 45445 | | fw+bw+opt=21053.63+22064.12+17987.52, nvidia-smi 40961 | fw+bw+opt=24684.74+21087.13+17987.52, nvidia-smi 46821 | oom after 10 steps, 疑似有内存泄漏 |
### 性能优化
在最早版本中我们使用DeepSpeed stage2 + Transformers中的原生Llama实现进行训练但是速度和论文中所说的相差较大因此后续我们进行了一系列的优化我们将每一步的性能提升列在下面可供参考。
论文中提到对于6.7B模型使用了1T token进行训练最终的gpu时为82432因此可以计算出他的训练速度大致为3370 token/s/gpu。
当使用下面的优化后速度开源基本和论文中速度一致使用20x8 A100-80G进行测试。预计加入更多融合算子开源取得更好的性能。
| | V1 | V2 |
| Model | Transformers | Transformers+xformers |
| Optimizer | Pytorch Adam | Fused Adam |
| DeepSpeed | stage2 | stage1 |
| Grad Accumulation | 4 | 12 |
| Return Padding Mask | yes | no |
| Speed token/s/gpu | 1378 | 3290 |
2023-03-27 07:09:15 +00:00
### 和其他开源模型性能对比
2023-03-27 07:41:49 +00:00
下表是一个对目前开源模型性能的一个总结使用GPU device均为A100由于模型大小各不相同结构也有一定差异难以准确的对比性能作为一个粗略估计可以认为速度和模型参数量基本呈反比关系这一点看Llama不同大小的模型可以得到印证。基于这个粗略估计可以看到使用本项目的性能明显由于其他项目。
2023-03-27 07:09:15 +00:00
| Model | Open-Llama | LLAMA | LLAMA | LLAMA | OPT | Bloom | GLM | GPT-NEOX | CPM-ANT | CodeGeeX |
| Model size | 6.9B | 6.7B | 13B | 65B | 175B | 175B | 130B | 20B | 10B | 13B |
| Token | | 1T | 1T | 1.4T | 180B | 366B | 400B | 402B | 200B | 13.9B |
| GPU Hour | | 82,432 | 135,168 | 1,022,362 | 809,472 | 1,082,990 | 43776 | 175680 | 47040 | 3072 |
| speed token/s/gpu | 3290 | 3370 | 2055 | 380 | 61.8 | 93.9 | 105.7 | 635.6 | 1181 | 1257 |
| 相关依赖 | xformers | xformers | | | measeq | Megatron-DeepSpeed | | | BMtrain | MindSpore |
| speed token/s/gpu/B | 22701 | 22579 | 26715 | 24700 | 10815 | 16432 | 13741 | 12712 | 11810 | 16341 |
2023-03-26 18:12:59 +00:00
## 后续计划
1. 加入更多训练监控,比如训练数据类别的分布等,加入继续训练相关代码
2. 开源预训练好的多语言Llama 6.9B的checkpoint
3. 实现Instruction-tuning代码并开源相关checkpoint
4. 使用Gradio搭建在线Demo
2023-03-26 18:42:22 +00:00
5. 使用[Triton](加入更多高性能算子,进一步提升性能
2023-03-26 18:12:59 +00:00
6. 加入根据Common Crawl构建预训练数据集相关代码并开源相关数据集
7. 加入多模态训练代码
## 引用
2023-03-26 18:16:13 +00:00
author={Liang Song},
2023-04-02 03:44:05 +00:00
<!-- 一些之前没注意到的部分
1. [GPT3](, Details of Model Training
During training we always train on sequences of the full nctx = 2048 token context window, packing multiple documents into a single sequence when documents are shorter than 2048, in order to increase computational efficiency. Sequences with multiple documents are not masked in any special way but instead documents within a sequence are delimited with a special end of text token, giving the language model the information necessary to infer that context separated by the end of text token is unrelated. This allows for efficient training without need for any special sequence-specific masking.
Sequence length A sequence length of 2048 was used for all models. Input examples are concatenated together and then split into sequences of exactly 2048 tokens, so that there are no padding tokens, but examples may be split in the middle. Input examples are differentiated from one another with a special [eod] token.
2. GPT3, Common Crawl Filtering
使用高质量文本作为正例其他所有样本作为负例。根据判为正例的概率作为筛选np.random.pareto(α) > 1 document_score。
The classifier is trained using logistic regression classifier with features from Sparks standard tokenizer and HashingTF.
3. GPT3, fuzzy deduplication
we fuzzily deduplicated documents (i.e. removed documents with high overlap with other documents) within each dataset using Sparks MinHashLSH implementation with 10 hashes
4. GPT3, Test Set Contamination
5. [The pile](, BPB(bits per UTF-8 encoded byte)/bits per character/perplexity
BPB = = (L_T /L_B)l/ ln(2) \\
perplexity = P(w1, w2, w3, w4, ...)^{-\frac{1}{N}} \\
bpc=-\frac{1}{T}\sum_i log_2 P(w_i|w1, w2, ..., w_{i-1}) \\
2^{bpc}=(\prod_i P(w_i|w1, w2, ..., w_{i-1}))^{-\frac{1}{T}}=perplexity
6. The pile, diversity of the collected data
We hypothesize that this is due to the perplexity based filtering used in CC-100, where a language model is trained on Wikipedia and all data with a perplexity too high or too low is discarded. This effectively discards any data too similar to or too different from Wikipedia, which severely limits the diversity of the collected data.
7. The pile, bytes per token
Since the GPT-2 BPE tokenizer is trained on WebText, the mean bytes per token is also a very rough indicator of how syntactically different each Pile component is from WebText.
8. The pile, Deduplication
We used 10 hash functions for each Minhash and an approximate Jaccard similarity of 0.5.
9. GLM, Embedding Layer Gradient Shrink
和stable embedding类似
word-embedding = word-embedding*\alpha+word-embedding.detach() (1\alpha)
10. PALM, Training Instability
2023-04-02 12:29:49 +00:00
Instead, we found that a simple strategy to effectively mitigate the issue: We re-started training from a checkpoint roughly 100 steps before the spike started, and skipped roughly 200500 data batches, which cover the batches that were seen before and during the spike. With this mitigation, the loss did not spike again at the same point. We do not believe that the spikes were caused by “bad data” per se, because we ran several ablation experiments where we took the batches of data that were surrounding the spike, and then trained on those same data batches starting from a different, earlier checkpoint. In these cases, we did not see a spike. This implies that spikes only occur due to the combination of specific data batches with a particular model parameter state
11. [Chinchilla](, Optimal model scaling
20 tokens per parameter, for example 10B model should use 200B tokens to pretrain
12. [Gopher](, Quality Filtering
Quality Filtering (MassiveWeb only) The vast majority of text found on the web is of insufficient
quality to be useful for language model training. For example, many web pages contain primarily
automatically generated content, or text that is not intended for human consumption (such as keywords
for search-engine optimisation). Much of the web also comprises social media content, which can
variously lack context, coherence, or substance. To remove low-quality data while minimising potential
for bias, we apply a number of simple, easily understood heuristic filters: we remove any document
that does not contain between 50 and 100,000 words, or whose mean word length is outside the
range of 3 to 10 characters; we remove any document with a symbol-to-word ratio greater than 0.1
for either the hash symbol or the ellipsis; and we remove any document with more than 90% of lines
starting with a bullet point, or more than 30% ending with an ellipsis. We also require that 80%
of words in a document contain at least one alphabetic character, and apply a "stop word" filter, to
remove documents that do not contain at least two of the following English words: the, be, to, of, and,
that, have, with; this adequately deals with ostensibly English documents that contain no coherent
English text.
13. Gopher, Constructing Token Sequences