gnh1201/Open-Llama

Fork 0

The complete training code of the open-source high-performance Llama model, including the full process from pre-training to RLHF.

Go to file

LiangSong 529f063234 update readme		2023-04-02 21:32:39 +08:00
assets	update readme	2023-03-31 14:58:07 +08:00
configs	fix long seq bug	2023-03-31 10:12:28 +08:00
data	add instruction-tuning	2023-03-30 23:43:12 +08:00
dataset	fix long seq bug	2023-03-31 10:12:28 +08:00
models	reformat code with black	2023-03-27 14:34:59 +08:00
.gitignore	add high-performance Llama pre-train code	2023-03-26 23:59:53 +08:00
inctruction_tuning.py	add instruction-tuning	2023-03-30 23:43:12 +08:00
LICENSE	Initial commit	2023-03-26 22:40:04 +08:00
pretrain_llama.py	add instruction-tuning	2023-03-30 23:43:12 +08:00
README_en.md	update readme	2023-04-02 21:32:39 +08:00
README.md	update readme	2023-04-02 21:32:39 +08:00
requirements.txt	update readme	2023-03-31 14:58:07 +08:00
server.py	add server	2023-03-31 14:11:07 +08:00

README_en.md

Open-Llama

Translated by ChatGPT.

Open-Llama is an open source project that provides a complete set of training processes for building large-scale language models, from data preparation to tokenization, pre-training, instruction tuning, and reinforcement learning techniques such as RLHF.

Progress

Although the complete pre-training is not finished yet, we used the 40K-step pre-trained model for instruction tuning, which can follow simple commands. Currently, there is no ability for multi-turn dialogue.

Demo

We tested our model by referring to some tests for Wenxin Yiyuan. Original report can be found at Baidu "Wenxin Yiyan" Test: What is the level of domestic generative AI?

The results of our model are shown in the following figure, and more results are yet to be further tested. Due to domestic network problems, the use of the above Demo may result in a request loss situation. If there is no response for a long time, please refresh and try again.

We roughly estimate the cost to achieve the above results. The 40K-step pre-training used 150 million pre-training data, which is about 110B tokens. The total training time is 76 hours, and the cost is about $19,152 according to Google Cloud's A100 quotation. The Instruction-tuning training was carried out for 12k steps, using 1.6 million data, and the total training time was 3.4 hours, costing about $342. Therefore, the total cost of training such a model from scratch is less than $20,000.

Currently, the model's performance in both mathematical and code-related tasks is noticeably poor. This is partially due to the training data used, but I also believe it is due to the size of the model. However, the ability to perform logical reasoning is essential for any usable model. Therefore, future updates will focus on improving this aspect of the model's capabilities.

Features

Ease of Use

We believe that ease of use is one of the most important features when building large-scale language models. To make Open-Llama more accessible, we focus on the following:

Minimal implementation: We use the simplest implementation approach to reduce the barrier to entry and make it easy for beginners to get started.
Complete workflow: We provide complete code from data set construction to training, making each step of building a large language model clear and visible.

High Performance

Since training large language models is costly, high performance is also crucial when building large-scale language models. To achieve high-performance training, we employ the following techniques:

Fused CUDA kernel: Using fused CUDA kernels provided by xformers can fuse multiple operations together, reducing data transfer between GPU and CPU, and improving training efficiency.
Parallel training: We use the Accelerate library to support parallel training on multiple GPUs, accelerating the training process.

For 7B mode, the training speed of the Llama model using the PyTorch native version in the Transformers library is 1378 tokens/s/GPU. With our code, the training speed reaches 3290 tokens/s/GPU, which is close to the reported 3370 tokens/s/GPU in the Llama paper. If we pretrain with 500 billion tokens, it will take 43,000 GPU hours. Assuming the price of A100-80G Spot on Google Cloud is $12.6 per hour for 8 GPUs, the total cost will be $67,725. Without acceleration, the cost would be $158,744. Our method reduces the training cost by $90,019 in total. More comparison can be found in Comparison of Performance with Other Open-Source Models.

Universality

When training language models, we aim to build a universal model that can be used for different languages and fields. To achieve this, we adopt the following strategies:

Multi-language support: We support a variety of language corpora, including English, Chinese, Japanese, and other languages, allowing users to choose according to their needs.
Field universality: We hope that the model can not only help with everyday problems but also assist in professional fields such as science and law.

Requirements

Python 3.7 or higher
PyTorch 1.13
Customized Transformers library
Accelerate library
CUDA 11.6 or higher version (for GPU acceleration, tested based on CUDA 11.7)

Getting Started

Installation

Use the following command to install the required dependencies:

pip install -r requirements.txt

Dataset Preparation

Currently, we provide the Wudao dataset from ZhuiyiAI and The Pile dataset from EleutherAI. The code for downloading and processing the datasets can be found in the data directory. Please note that the Wudao dataset requires agreeing to some agreements before downloading, so you may need to modify the link in download_wudao.sh. WuDao

Use the following commands to download and shard the data:

bash data/download_the_pile.sh
bash data/download_wudao.sh

The data will be stored as small files with a maximum of 16,384 lines per file for efficient multi-processing training. The storage format is jsonl.zst compressed with zstd, resulting in a total data size of 519.5 GB and 16,466 files.

The Pile dataset contains 210,607,728 rows of JSON lines, and the Wudao dataset contains 59,132,213 rows of JSON lines.

The specific data format is as follows:

WuDao
{'id': 1, 'dataType': '百科', 'title': 'some title', 'content': 'some content'}

The Pile
{'text': 'some text', 'meta': {'pile_set_name': 'Github'}}

Data Loading

The code for loading data can be found in the dataset directory, which includes training a tokenizer using SentencePiece and constructing a DataLoader based on the tokenizer.

Train the tokenizer with the following command:

python3 dataset/train_tokenizer.py

Check the DataLoader output with the following command:

python3 dataset/pretrain_dataset.py

Model Structure

We modified the Llama model in the Transformers library based on section 2.4 "Efficient Implementation" in the original paper and introduced some optimizations from other papers. Specifically, we introduced the memory_efficient_attention operation from the xformers library by META for computing self-attention, which significantly improves performance by about 30%. Please refer to modeling_llama.py for details.

We also referred to Bloom for introducing stable embeddings for better training of token embeddings.

Finally, we referred to PALM and used shared input-output embeddings.

Pretraining

We use the Accelerate library for multi-GPU parallel training. Launch training with the following command:

accelerate launch --config_file configs/default_config.yaml pretrain_llama.py

In some cases, it may be necessary to specify the following parameters.

--main_process_ip
--main_process_port
--num_processes
--num_machines
--machine_rank

We use Wandb for training visualization and you need to modify the environment variable WANDB_API_KEY.

We use DeepSpeed stage 1 to reduce GPU memory usage. Accelerate-related configurations can be found in configs/default_config.yaml.

The training-related hyperparameters can be found in configs/train_config.py. We currently train a 7B Llama model with a vocabulary size of 100,000, and the specific configuration is as follows:

max_length	batch_size	learning_rate	weight_decay	params	dimension	n heads	n layer	vocab_size
1024	2	2e-4	1e-1	6.88B	4096	32	32	100000

=========================================================================================================
Layer (type:depth-idx)                                  Output Shape              Param #
=========================================================================================================
LlamaForCausalLM                                        [1, 64, 32, 128]          --
├─LlamaModel: 1-1                                       [1, 64, 32, 128]          --
│    └─Embedding: 2-1                                   [1, 64, 4096]             409,600,000
│    └─LayerNorm: 2-2                                   [1, 64, 4096]             8,192
│    └─ModuleList: 2-3                                  --                        --
│    │    └─LlamaDecoderLayer: x32                      [1, 64, 4096]             202,383,360 x 32
│    └─LlamaRMSNorm: 2-4                                [1, 64, 4096]             4,096
=========================================================================================================
Total params: 6,885,879,808
Trainable params: 6,885,879,808
Non-trainable params: 0
Total mult-adds (G): 6.89

Current Progress

Instruction-Tuning

We performed instruction-tuning on three currently available open-source datasets, and we plan to add more tasks and our own constructed datasets in the future.

We did some preprocessing on the raw data, the format is as follows:

user: {prompt}<s>system: {completion}</s>

The training code is similar to pre-training and can be seen in

instruction_tuning.py

The launch command is also similar to pre-training:

accelerate launch --config_file configs/default_config.yaml instruction_tuning.py

In some cases, the following parameters may need to be specified:

--main_process_ip
--main_process_port
--num_processes
--num_machines
--machine_rank

The loss during the process is as follows, basically fluctuating and not decreasing much:

RLHF

Performance Comparison

Training Framework

In terms of the training framework, we tested the HuggingFace's Accelerate library and HPC-AI's ColossalAI, and found that there was little difference in performance when running on fully utilized GPUs. Therefore, we ultimately chose the relatively simple Accelerate library as our training framework.

The test data is shown below, and the model structure used during testing is:

Model	n gpu	n layer	n heads	hidden size	vocab size	seq length
GPT2	2	6	heads	4096	250100	1024

The test results are shown below, and we can see that there is little difference in speed and memory utilization when running on fully utilized GPUs:

	HuggingFace	HuggingFace	ColossalAI	ColossalAI	ColossalAI
config	without activation ckpt, bs2	without activation ckpt, max_bs=12	with activation ckpt, bs2	without activation ckpt, bs2	without activation ckpt, max_bs=10
second pre step	0.336, fw=0.033, bw=0.3, opt=5e-6	1.25	0.347	0.308, fw=0.067, bw=0.152, opt=0.088	1.055
gpu memory	nvidia-smi 45445		fw+bw+opt=21053.63+22064.12+17987.52, nvidia-smi 40961	fw+bw+opt=24684.74+21087.13+17987.52, nvidia-smi 46821	oom after 10 steps, suspected memory leak

Performance Optimization

In the earliest version, we used DeepSpeed stage2 and the native Llama implementation in Transformers for training. However, the speed was significantly different from what was reported in the paper. Therefore, we conducted a series of optimizations and list the performance improvements for each step below.

The paper mentions that they trained the 6.7B model with 1T tokens, and the GPU utilization was 82432, so the training speed was approximately 3370 tokens/s/GPU. After implementing the following optimizations, our speed is now comparable to that reported in the paper, using 20x8 A100-80G for testing. We expect to achieve better performance by adding more fusion operators in the future.

	V1	V2
Model	Transformers	Transformers+xformers
Optimizer	Pytorch Adam	Fused Adam
DeepSpeed	stage2	stage1
Grad Accumulation	4	12
Return Padding Mask	yes	no
Speed token/s/gpu	1378	3290

Performance Comparison with Other Open-source Models

The following table summarizes the performance of current open-source models, all tested on A100 GPUs. Due to differences in model sizes and structures, it is difficult to make accurate performance comparisons. As a rough estimate, it can be assumed that speed and model parameter count are inversely proportional, as evidenced by Llama models of different sizes. Based on this rough estimate, it can be seen that the performance using our project is significantly better than other projects.

Model	Open-Llama	LLAMA	LLAMA	LLAMA	OPT	Bloom	GLM	GPT-NEOX	CPM-ANT	CodeGeeX
Model size	6.9B	6.7B	13B	65B	175B	175B	130B	20B	10B	13B
Token		1T	1T	1.4T	180B	366B	400B	402B	200B	13.9B
GPU Hour		82,432	135,168	1,022,362	809,472	1,082,990	43776	175680	47040	3072
speed token/s/gpu	3290	3370	2055	380	61.8	93.9	105.7	635.6	1181	1257
Dependencies	xformers	xformers			measeq	Megatron-DeepSpeed			BMtrain	MindSpore
speed token/s/gpu/B	22701	22579	26715	24700	10815	16432	13741	12712	11810	16341

Future Plans

Add more training monitoring, such as the distribution of training data categories, and add code for continuing training.
Realease the pre-trained checkpoint for the multi-lingual Llama 6.9B model.
Implement instruction-tuning code and open-source related checkpoints. Build an online demo using Gradio.
Use Triton to add more high-performance operators and further improve performance.
Add code for building pre-training datasets based on Common Crawl and open-source related datasets.
Add code for multi-modal training.

Citation

@misc{openllama,
  title={Open-Llama},
  author={Liang Song},
  year={2023},
  howpublished={\url{https://github.com/Bayes-Song/Open-Llama}},
}