update wudao download and preprocess
This commit is contained in:
parent
7dc90c2558
commit
32583a41a7
|
@ -71,7 +71,7 @@ Below is a display of the model's multi-turn dialogue ability regarding code:
|
|||
| | DeepSpeed Stage | Offload | Activation Checkpoint | Total Token | GPU hours | Speed token/s/gpu | Batch Size | CPU Memory |
|
||||
|----------------|-----------------|---------|-----------------------|-------------|-----------|-------------------|------------|------------|
|
||||
| Open-Llama 7B | 1 | False | False | 173.7B | 13412 | 3587 | 2 | 94G |
|
||||
| Open-Llama 13B | 3 | False | True | - | - | 1616 | 12 | 100G |
|
||||
| Open-Llama 13B | 3 | False | True | - | - | 1856 | 24 | 100G |
|
||||
| Open-Llama 33B | 3 | False | True | - | - | 708 | 12 | 100G |
|
||||
| Open-Llama 65B | 3 | True | True | - | - | 369 | 12 | 440G |
|
||||
| Llama 7B | - | - | - | 1T | 82432 | 3370 | - | - |
|
||||
|
@ -154,6 +154,8 @@ pip install -r requirements.txt
|
|||
Currently provided are the Wudao dataset open-sourced by Zhiyuan and the Pile dataset open-sourced by EleutherAI. Dataset download and processing scripts are located in the data directory.
|
||||
Due to the required agreement for downloading the Wudao dataset, you may need to modify the link in download_wudao. [Wudao](https://data.baai.ac.cn/details/WuDaoCorporaText).
|
||||
|
||||
Thanks to @skepsun 's suggestion, using scidb to download the wudao dataset does not require login, and the download is more stable. https://github.com/s-JoL/Open-Llama/issues/42.
|
||||
|
||||
**Note that data download may fail. It is recommended to divide the download and processing in the script into two parts for multiple attempts, which will automatically resume downloads from breakpoints.**
|
||||
|
||||
Run the following commands to download the data and perform partitioning:
|
||||
|
|
|
@ -72,7 +72,7 @@ print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
|
|||
| | DeepSpeed Stage | Offload | Activation Checkpoint | Total Token | GPU hours | Speed token/s/gpu | Batch Size | CPU Memory |
|
||||
|----------------|-----------------|---------|-----------------------|-------------|-----------|-------------------|------------|------------|
|
||||
| Open-Llama 7B | 1 | False | False | 173.7B | 13412 | 3587 | 2 | 94G |
|
||||
| Open-Llama 13B | 3 | False | True | - | - | 1616 | 12 | 100G |
|
||||
| Open-Llama 13B | 3 | False | True | - | - | 1856 | 24 | 100G |
|
||||
| Open-Llama 33B | 3 | False | True | - | - | 708 | 12 | 100G |
|
||||
| Open-Llama 65B | 3 | True | True | - | - | 369 | 12 | 440G |
|
||||
| Llama 7B | - | - | - | 1T | 82432 | 3370 | - | - |
|
||||
|
@ -153,6 +153,8 @@ pip install -r requirements.txt
|
|||
目前给出了智源开源的悟道数据集和EleutherAI开源的the pile数据集。数据集下载和处理代码在data目录下。
|
||||
其中悟道数据集由于需要同意一些协议才能下载因此可能需要修改一下download_wudao中的链接,[悟道](https://data.baai.ac.cn/details/WuDaoCorporaText)。
|
||||
|
||||
感谢@skepsun 的建议,使用scidb下载wudao数据集不需要登陆,并且下载更稳定一些。https://github.com/s-JoL/Open-Llama/issues/42
|
||||
|
||||
**注意数据下载可能出现失败,建议将script中的下载和处理分成两部分来运行,可以将下载多运行机会,会自动断点续传。**
|
||||
|
||||
运行下面的命令进行数据下载并进行分片
|
||||
|
|
|
@ -10,10 +10,14 @@
|
|||
# Copyright (c) 2023 by LiangSong(sl12160010@gmail.com), All Rights Reserved.
|
||||
###
|
||||
apt install unrar
|
||||
for i in {1..100}
|
||||
do
|
||||
curl -C - --retry 100 'https://dorc.baai.ac.cn/resources/data/WuDaoCorpora2.0/WuDaoCorpus2.0_base_200G.rar?AccessKeyId=AKLTNasiLRBBTcOgPqzlkPzu1w&Expires=1679127659&Signature=7jh%2FpnJyC2hAeumm9EjaeE5HN9E%3D' -o data/WuDaoCorpus2.0_base_200G.rar
|
||||
done
|
||||
unrar x data/WuDaoCorpus2.0_base_200G.rar
|
||||
|
||||
wget -v -c 'https://download.scidb.cn/download?fileId=63a30383fed6a8a9e8454302&dataSetType=organization&fileName=WuDaoCorporaText-2.0-open.rar' -O data/WuDaoCorpus2.0_base_200G.rar
|
||||
|
||||
# for i in {1..100}
|
||||
# do
|
||||
# curl -C - --retry 100 'https://dorc.baai.ac.cn/resources/data/WuDaoCorpora2.0/WuDaoCorpus2.0_base_200G.rar?AccessKeyId=AKLTNasiLRBBTcOgPqzlkPzu1w&Expires=1679127659&Signature=7jh%2FpnJyC2hAeumm9EjaeE5HN9E%3D' -o data/WuDaoCorpus2.0_base_200G.rar
|
||||
# done
|
||||
|
||||
unrar x data/WuDaoCorpus2.0_base_200G.rar data/
|
||||
mkdir data/pretrain_data
|
||||
python3 data/preprocess_wudao.py
|
|
@ -15,7 +15,7 @@ seaborn
|
|||
sentencepiece
|
||||
triton
|
||||
functorch==1.13.1
|
||||
xformers
|
||||
xformers==0.0.16
|
||||
gradio
|
||||
peft
|
||||
git+https://github.com/huggingface/transformers.git
|
|
@ -80,9 +80,13 @@ def main(argv):
|
|||
if hasattr(raw_model, "enable_input_require_grads"):
|
||||
raw_model.enable_input_require_grads()
|
||||
else:
|
||||
|
||||
def make_inputs_require_grad(module, input, output):
|
||||
output.requires_grad_(True)
|
||||
raw_model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)
|
||||
|
||||
raw_model.get_input_embeddings().register_forward_hook(
|
||||
make_inputs_require_grad
|
||||
)
|
||||
peft_config = LoraConfig(
|
||||
task_type=TaskType.CAUSAL_LM,
|
||||
target_modules=["q_proj", "v_proj"],
|
||||
|
|
Loading…
Reference in New Issue
Block a user