Optimized the structure of configs, added support for deepspeed stage3, reduced memory usage by using Auto class to load models, and added support for training 65B models.

This commit is contained in:
LiangSong 2023-05-06 23:37:17 +08:00
parent 5b1f6a4861
commit 4a1e7bb44b
21 changed files with 190 additions and 67 deletions

View File

@ -2,7 +2,7 @@
* @Author: LiangSong(sl12160010@gmail.com)
* @Date: 2023-03-10 21:18:35
* @LastEditors: LiangSong(sl12160010@gmail.com)
* @LastEditTime: 2023-05-04 22:55:25
* @LastEditTime: 2023-05-06 23:33:11
* @FilePath: /Open-Llama/README.md
* @Description:
*
@ -211,7 +211,7 @@ Finally, we referenced [PALM](https://arxiv.org/abs/2204.02311) and employed Sha
We use multi-GPU parallel training based on the Accelerate library, with the following start command:
```bash
accelerate launch --config_file configs/default_config.yaml train_lm.py --config configs/pretrain_config.yaml
accelerate launch --config_file configs/accelerate_configs/ds_stage1.yaml train_lm.py --config configs/pretrain_config.yaml
```
In some cases, you may need to specify the following parameters:
@ -225,7 +225,7 @@ In some cases, you may need to specify the following parameters:
We use [Wandb](https://wandb.ai/) for visualizing training. You need to modify the WANDB_API_KEY environment variable yourself.
Among them, we use DeepSpeed stage1 to reduce memory usage. For Accelerate-related configurations, see configs/default_config.yaml.
Among them, we use DeepSpeed stage1 to reduce memory usage. For Accelerate-related configurations, see configs/accelerate_configs.
Training related hyperparameters can be found in configs/pretrain_config.yaml.
@ -279,7 +279,7 @@ user: {prompt}\nsystem: {completion}</s>
The startup command is basically the same as pre-training:
```bash
accelerate launch --config_file configs/default_config.yaml train_lm.py --config configs/instruct_config.yaml
accelerate launch --config_file configs/accelerate_configs/ds_stage1.yaml train_lm.py --config configs/instruct_config.yaml
```
In some cases, you may need to specify the following parameters:

View File

@ -2,7 +2,7 @@
* @Author: LiangSong(sl12160010@gmail.com)
* @Date: 2023-03-10 21:18:35
* @LastEditors: LiangSong(sl12160010@gmail.com)
* @LastEditTime: 2023-05-04 22:55:32
* @LastEditTime: 2023-05-06 23:32:31
* @FilePath: /Open-Llama/README_zh.md
* @Description:
*
@ -201,7 +201,7 @@ Self Attention的计算这对于性能有明显的提升提升大约30%。
### 预训练
我们基于Accelerate库进行多GPU并行训练启动命令如下
```bash
accelerate launch --config_file configs/default_config.yaml train_lm.py --config configs/pretrain_config.yaml
accelerate launch --config_file configs/accelerate_configs/ds_stage1.yaml train_lm.py --config configs/pretrain_config.yaml
```
某些情况下可能需要指定下列参数
```
@ -213,7 +213,7 @@ accelerate launch --config_file configs/default_config.yaml train_lm.py --config
```
我们使用[Wandb](https://wandb.ai/)进行训练的可视化,需要自行修改环境变量 WANDB_API_KEY 。
其中我们使用了DeepSpeed stage1以减少显存占用。accelerate相关配置可见configs/default_config.yaml
其中我们使用了DeepSpeed stage1以减少显存占用。accelerate相关配置可见configs/accelerate_configs
训练相关超参数可见configs/pretrain_config.yaml
@ -263,7 +263,7 @@ user: {prompt}\nsystem: {completion}</s>
启动命令和预训练基本一致
```bash
accelerate launch --config_file configs/default_config.yaml train_lm.py --config configs/instruct_config.yaml
accelerate launch --config_file configs/accelerate_configs/ds_stage1.yaml train_lm.py --config configs/instruct_config.yaml
```
某些情况下可能需要指定下列参数
```

View File

@ -2,7 +2,7 @@
Author: LiangSong(sl12160010@gmail.com)
Date: 2023-04-06 22:30:10
LastEditors: LiangSong(sl12160010@gmail.com)
LastEditTime: 2023-05-04 22:44:58
LastEditTime: 2023-05-06 23:30:57
FilePath: /Open-Llama/chat_server.py
Description:
@ -15,7 +15,7 @@ from transformers import OpenLlamaForCausalLM, OpenLlamaConfig, LlamaTokenizer
tokenizer = LlamaTokenizer(
"configs/10w_vocab_wudao5_pile10.model",
"configs/tokenizer_models/10w_vocab_wudao5_pile10.model",
pad_token="<pad>",
add_bos_token=False,
add_eos_token=True,
@ -42,7 +42,7 @@ if "module" in ckpt:
raw_model.load_state_dict(ckpt)
raw_model.eval()
model = raw_model.half().cuda()
logging.warn("ready")
logging.warning("ready")
with gr.Blocks() as demo:
@ -59,7 +59,7 @@ with gr.Blocks() as demo:
clear = gr.Button("Clear")
def user(user_message, history):
logging.warn(user_message)
logging.warning(user_message)
return "", history + [[user_message, None]]
def bot(history):
@ -92,7 +92,7 @@ with gr.Blocks() as demo:
pred = model.generate(input_ids=context, max_new_tokens=1024, do_sample=True)
pred = pred[:, inputs_len:]
pred = tokenizer.decode(pred.cpu()[0], skip_special_tokens=True)
logging.warn(pred)
logging.warning(pred)
bot_message = pred
history[-1][1] = bot_message
return history

View File

@ -0,0 +1,18 @@
compute_environment: LOCAL_MACHINE
deepspeed_config:
deepspeed_multinode_launcher: standard
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: true
zero_stage: 3
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false

View File

@ -7,13 +7,8 @@ data:
concat_multiple_sequence: True
num_sequences: 50
seq_length: 2048
tokenizer_model_path: "configs/llama_tokenizer_extended.model"
model:
initializer_range: 1.0e-2
hidden_dropout_prob: 0.1
attention_dropout_prob: 0.1
use_stable_embedding: False
shared_input_output_embedding: False
tokenizer_model_path: "configs/tokenizer_models/llama_tokenizer_extended.model"
split_by_shard: False
train:
train_batch_size: 2
# 1B token for 1 epoch, 5epoch
@ -27,6 +22,7 @@ train:
gradient_accumulation_steps: 1
prefetch_factor: 100
train_and_eval: False
gradient_checkpointing_enable: False
# global step
log_interval: 50
eval_interval: 500

View File

@ -0,0 +1,26 @@
{
"architectures": [
"OpenLlamaForCausalLM"
],
"attention_dropout_prob": 0.1,
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_dropout_prob": 0.1,
"hidden_size": 5120,
"initializer_range": 1e-2,
"intermediate_size": 13824,
"max_position_embeddings": 2048,
"model_type": "open-llama",
"num_attention_heads": 40,
"num_hidden_layers": 40,
"pad_token_id": 32000,
"rms_norm_eps": 1e-05,
"shared_input_output_embedding": false,
"tie_word_embeddings": false,
"torch_dtype": "float32",
"use_cache": true,
"use_memorry_efficient_attention": true,
"use_stable_embedding": false,
"vocab_size": 68762
}

View File

@ -0,0 +1,26 @@
{
"architectures": [
"OpenLlamaForCausalLM"
],
"attention_dropout_prob": 0.1,
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_dropout_prob": 0.1,
"hidden_size": 6656,
"initializer_range": 1e-2,
"intermediate_size": 17920,
"max_position_embeddings": 2048,
"model_type": "open-llama",
"num_attention_heads": 52,
"num_hidden_layers": 60,
"pad_token_id": 32000,
"rms_norm_eps": 1e-05,
"shared_input_output_embedding": false,
"tie_word_embeddings": false,
"torch_dtype": "float32",
"use_cache": true,
"use_memorry_efficient_attention": true,
"use_stable_embedding": false,
"vocab_size": 68762
}

View File

@ -0,0 +1,26 @@
{
"architectures": [
"OpenLlamaForCausalLM"
],
"attention_dropout_prob": 0.1,
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_dropout_prob": 0.1,
"hidden_size": 8192,
"initializer_range": 1e-2,
"intermediate_size": 22016,
"max_position_embeddings": 2048,
"model_type": "open-llama",
"num_attention_heads": 64,
"num_hidden_layers": 80,
"pad_token_id": 32000,
"rms_norm_eps": 1e-05,
"shared_input_output_embedding": false,
"tie_word_embeddings": false,
"torch_dtype": "float32",
"use_cache": true,
"use_memorry_efficient_attention": true,
"use_stable_embedding": false,
"vocab_size": 68762
}

View File

@ -0,0 +1,26 @@
{
"architectures": [
"OpenLlamaForCausalLM"
],
"attention_dropout_prob": 0.1,
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_dropout_prob": 0.1,
"hidden_size": 4096,
"initializer_range": 1e-2,
"intermediate_size": 11008,
"max_position_embeddings": 2048,
"model_type": "open-llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"pad_token_id": 32000,
"rms_norm_eps": 1e-05,
"shared_input_output_embedding": false,
"tie_word_embeddings": false,
"torch_dtype": "float32",
"use_cache": true,
"use_memorry_efficient_attention": true,
"use_stable_embedding": false,
"vocab_size": 68762
}

View File

@ -9,13 +9,8 @@ data:
concat_multiple_sequence: True
num_sequences: 10
seq_length: 2048
tokenizer_model_path: "configs/llama_tokenizer_extended.model"
model:
initializer_range: 1.0e-2
hidden_dropout_prob: 0.1
attention_dropout_prob: 0.1
use_stable_embedding: False
shared_input_output_embedding: False
tokenizer_model_path: "configs/tokenizer_models/llama_tokenizer_extended.model"
split_by_shard: False
train:
train_batch_size: 2
num_training_steps: 500000
@ -28,7 +23,8 @@ train:
train_num_workers: 16
gradient_accumulation_steps: 12
prefetch_factor: 100
train_and_eval: True
train_and_eval: False
gradient_checkpointing_enable: True
# global step
log_interval: 5
eval_interval: 500

View File

@ -2,7 +2,7 @@
Author: LiangSong(sl12160010@gmail.com)
Date: 2023-04-24 20:05:21
LastEditors: LiangSong(sl12160010@gmail.com)
LastEditTime: 2023-05-04 09:17:21
LastEditTime: 2023-05-06 23:30:37
FilePath: /Open-Llama/dataset/dataset.py
Description:
@ -105,7 +105,7 @@ def instruct_transform(batch):
targets = batch["message_2"][0]
text = "user:{}\nsystem:{}".format(inputs.strip(), targets.strip())
texts = [text]
# grade-school-math-instructions preprocess
# grade-school-math-instructions preprocess
elif "INSTRUCTION" in batch and "RESPONSE" in batch:
inputs = batch["INSTRUCTION"][0]
targets = batch["RESPONSE"][0]
@ -291,7 +291,7 @@ if __name__ == "__main__":
"seq_length": 2048,
}
tokenizer = LlamaTokenizer(
"configs/llama_tokenizer_extended.model",
"configs/tokenizer_models/llama_tokenizer_extended.model",
pad_token="<pad>",
add_bos_token=False,
add_eos_token=True,

View File

@ -2,7 +2,7 @@
Author: LiangSong(sl12160010@gmail.com)
Date: 2023-04-24 20:05:21
LastEditors: LiangSong(sl12160010@gmail.com)
LastEditTime: 2023-05-06 09:45:30
LastEditTime: 2023-05-06 23:04:14
FilePath: /Open-Llama/solver/trainer.py
Description:
@ -26,7 +26,7 @@ class Trainer:
self.train_loader = train_loader
self.tokenizer = tokenizer
self.accelerator = accelerator
self.train_and_eval = config.get("train_and_eval", False)
self.train_and_eval = config["train"].get("train_and_eval", False)
self.gradient_accumulation_steps = config["train"].get(
"gradient_accumulation_steps", 1
)
@ -43,7 +43,7 @@ class Trainer:
self.config["save_interval"] * accelerator.gradient_accumulation_steps
)
self.work_dir = self.config["work_dir"]
self.get_model_info()
# self.get_model_info()
if accelerator.is_main_process:
wandb.init(project=self.config["project_name"])
@ -104,12 +104,12 @@ class Trainer:
self.accelerator.load_state(self.work_dir)
self.global_step = self.scheduler.scheduler._step_count - 1
self.global_step = self.global_step // self.accelerator.num_processes
logging.warn("Restored ckpt from {}".format(self.work_dir))
logging.warning("Restored ckpt from {}".format(self.work_dir))
except:
logging.warn("No ckpt found in {}".format(self.work_dir))
logging.warning("No ckpt found in {}".format(self.work_dir))
if self.global_step > 0:
skip_steps = self.global_step * self.gradient_accumulation_steps
logging.warn("Skiped {} steps.".format(skip_steps))
logging.warning("Skiped {} steps.".format(skip_steps))
self.train_loader_skiped = self.accelerator.skip_first_batches(
self.train_loader, num_batches=skip_steps
)

View File

@ -2,7 +2,7 @@
Author: LiangSong(sl12160010@gmail.com)
Date: 2023-04-12 19:12:42
LastEditors: LiangSong(sl12160010@gmail.com)
LastEditTime: 2023-05-04 09:19:15
LastEditTime: 2023-05-06 23:08:42
FilePath: /Open-Llama/train_lm.py
Description:
@ -16,17 +16,20 @@ from absl import flags
from accelerate import Accelerator
from torch.utils.data import DataLoader
from datasets.distributed import split_dataset_by_node
from transformers import OpenLlamaForCausalLM, OpenLlamaConfig, LlamaTokenizer
from transformers import AutoConfig, AutoModelForCausalLM, LlamaTokenizer
from dataset.dataset import construct_dataset
from solver.trainer import Trainer
FLAGS = flags.FLAGS
flags.DEFINE_string("config", None, "Training config path")
flags.DEFINE_string("train_config", None, "Training config path")
flags.DEFINE_string(
"model_config", "configs/model_configs/7B.json", "Model config path"
)
def main(argv):
with open(FLAGS.config, "r", encoding="utf-8") as fp:
with open(FLAGS.train_config, "r", encoding="utf-8") as fp:
config = yaml.load(fp, Loader=yaml.FullLoader)
accelerator = Accelerator(
@ -61,26 +64,18 @@ def main(argv):
)
# smaller initializer_range make training more stable
# add stabel embedding to token embedding
raw_model = OpenLlamaForCausalLM(
OpenLlamaConfig(
vocab_size=tokenizer.vocab_size,
initializer_range=config["model"]["initializer_range"],
pad_token_id=tokenizer.pad_token_id,
rms_norm_eps=1e-5,
hidden_dropout_prob=config["model"]["hidden_dropout_prob"],
attention_dropout_prob=config["model"]["attention_dropout_prob"],
use_stable_embedding=config["model"]["use_stable_embedding"],
shared_input_output_embedding=config["model"][
"shared_input_output_embedding"
],
)
)
model_config = AutoConfig.from_pretrained(FLAGS.model_config)
model_config.vocab_size = tokenizer.vocab_size
model_config.pad_token_id = tokenizer.pad_token_id
if config["train"]["ckpt"] is not None:
ckpt = torch.load(config["train"]["ckpt"], map_location="cpu")
if "module" in ckpt:
ckpt = ckpt["module"]
raw_model.load_state_dict(ckpt)
logging.warn("Loaded ckpt from: {}".format(config["train"]["ckpt"]))
raw_model = AutoModelForCausalLM.from_pretrained(
config["train"]["ckpt"], config=model_config
)
logging.warning("Loaded ckpt from: {}".format(config["train"]["ckpt"]))
else:
raw_model = AutoModelForCausalLM.from_config(model_config)
if config["train"].get("gradient_checkpointing_enable", False):
raw_model.gradient_checkpointing_enable()
trainer = Trainer(config, raw_model, train_loader, tokenizer, accelerator)
trainer.train()

View File

@ -1,9 +1,19 @@
"""
Author: LiangSong(sl12160010@gmail.com)
Date: 2023-04-28 19:55:13
LastEditors: LiangSong(sl12160010@gmail.com)
LastEditTime: 2023-05-06 23:30:29
FilePath: /Open-Llama/utils/convert_ckpt.py
Description:
Copyright (c) 2023 by LiangSong(sl12160010@gmail.com), All Rights Reserved.
"""
import torch
import sentencepiece as spm
sp_model = spm.SentencePieceProcessor(
model_file="configs/llama_tokenizer_extended.model"
model_file="configs/tokenizer_models/llama_tokenizer_extended.model"
)
merged_vocab_size = sp_model.vocab_size()
ckpt = torch.load("data/llama_raw_ckpt/7B/consolidated.00.pth")

View File

@ -3,21 +3,25 @@ import sentencepiece as spm
from sentencepiece import sentencepiece_model_pb2 as model
raw_model = model.ModelProto()
raw_model.ParseFromString(open("configs/llama_tokenizer.model", "rb").read())
raw_model.ParseFromString(
open("configs/tokenizer_models/llama_tokenizer.model", "rb").read()
)
exist_pieces = set([p.piece for p in raw_model.pieces])
cn_model = model.ModelProto()
cn_model.ParseFromString(open("configs/4w_cn_vocab_wudao15.model", "rb").read())
cn_model.ParseFromString(
open("configs/tokenizer_models/4w_cn_vocab_wudao15.model", "rb").read()
)
for p in tqdm(cn_model.pieces, total=len(cn_model.pieces)):
if p.piece not in exist_pieces:
raw_model.pieces.append(p)
with open("configs/llama_tokenizer_extended.model", "wb") as f:
with open("configs/tokenizer_models/llama_tokenizer_extended.model", "wb") as f:
f.write(raw_model.SerializeToString())
sp_model = spm.SentencePieceProcessor(
model_file="configs/llama_tokenizer_extended.model"
model_file="configs/tokenizer_models/llama_tokenizer_extended.model"
)
print("merged vocab size: {}".format(sp_model.vocab_size()))

View File

@ -2,7 +2,7 @@
Author: LiangSong(sl12160010@gmail.com)
Date: 2023-03-24 20:49:03
LastEditors: LiangSong(sl12160010@gmail.com)
LastEditTime: 2023-05-04 08:42:21
LastEditTime: 2023-05-06 23:34:14
FilePath: /Open-Llama/utils/train_tokenizer.py
Description:
@ -67,7 +67,7 @@ spm.SentencePieceTrainer.train(
)
# Serialize the model as file.
with open("configs/10w_vocab_wudao5_pile10.model", "wb") as f:
with open("configs/tokenizer_models/10w_vocab_wudao5_pile10.model", "wb") as f:
f.write(model.getvalue())
# Directly load the model from serialized model.