Open-Llama

Table of Contents

Open-LLaMa 테스트 과정

Open-LLaMa 테스트 과정

저장소 다운로드(클론) 및 패키지 설치

$ git clone https://git.catswords.net/gnh1201/Open-Llama
$ pip install -r requirements.txt   # 여기에 누락된 패키지는 작업 중 오류(예: No module found)를 확인하면서 추가 설치할 것

수정된 버전(from git.catswords.net)을 다운로드 받은 경우 2번 과정은 생략하여도 됨.

dataset 다운로드 위치 변경 및 다운로드
- data/download_the_pile.sh 파일의 명령어 수정 (저작권 문제로 원 링크가 닫힘). 참고: a24271d48e
- 모델 다운로드 및 전처리
```
bash data/download_the_pile.sh
bash data/download_wudao.sh
```

LLaMA-7B 모델 다운로드 및 복사

git-lfs (대용량 git 지원) 설치

$ curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
$ sudo apt install git-lfs
$ git lfs install

LLaMA-7B 모델 다운로드(클론)

$ git clone https://huggingface.co/nyanko7/LLaMA-7B

LLaMa-7B 모델 연결 및 체크포인트 변환

$ cd ~/Open-LLaMa
$ mkdir -p ./data/llama_raw_ckpt
$ ln -s ~/LLaMA-7B ./data/llama_raw_ckpt/7B
$ python3 utils/convert_ckpt.py

데이터 로딩
- fsspec 패키지 교체. 참고: #2
```
$ pip install fsspec==2023.9.2
```
- 데이터 로딩 실행
```
$ python3 dataset/dataset.py
```

NVIDA 드라이버 및 CUDA 설치

이전 버전 또는 잘못 설치된 nvidia 드라이버 및 cuda 청소
1. /root 또는 /home/[사용자이름] 디렉토리 아래 .bashrc 또는 .profile, .bash_profile 파일에 수동으로 환경 변수 설정한 것이 있다면 모두 제거 (또는 주석 처리)
2. 기존 설치된 nvidia 드라이버 및 cuda 청소 및
```
$ sudo apt-get purge nvidia*
$ sudo apt-get autoremove
$ sudo apt-get autoclean
$ sudo rm -rf /usr/local/cuda*
```

nvida 드라이버 설치

설치 가능한 드라이버 찾기

$ sudo apt install ubuntu-drivers-common
$ sudo ubuntu-drivers devices    # 버전 숫자가 높은 드라이버를 선택

확인된 드라이버 설치

$ sudo apt install nvidia-driver-535

cuda 설치

$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
$ sudo dpkg -i cuda-keyring_1.1-1_all.deb
$ sudo apt-get update
$ sudo apt-get -y install cuda

nvidia 모듈 다시 로드
1. 모듈 등록 해제
```
$ sudo rmmod nvidia_drm
$ sudo rmmod nvidia_modeset
$ sudo rmmod nvidia_uvm
$ sudo rmmod nvidia
```
1. 모듈 재등록(자동) 및 그래픽카드 인식 여부 확인
```
$ nvidia-smi
```

도커(Docker) 설치 및 cuda 지원 이미지 빌드
- Docker 설치: Docker installation script 를 이용하여 빠르게 설치할 수 있음.
```
$ sudo -s
# cd /opt
# curl -fsSL https://get.docker.com -o get-docker.sh
# sh get-docker.sh
```
- docker NVIDIA GPU 지원 (nvidia-container-toolkit) 설치
  1. APT 패키지 정보 추가
```
$ sudo -s
# cd /etc/apt/sources.list.d
# wget https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list
# apt update
# apt install nvidia-container-toolkit
# systemctl restart docker
```
  오류 시 참고: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
- cuda + cudnn 지원 이미지 빌드
  1. 본 설치 작업을 위해 작성한 Dockerfile 및 docker-compose.yml 확인 (이 작업은 주로 #3 에서 다루는 문제를 해결함. 자세한 내용은 해당 파일 참고.)
  2. 이미지 빌드: docker compose up --build
  3. CUDA 환영 메시지가 보이면 (CTRL + C) 키 눌러서 콘솔 빠져 나옴
- 컨테이너 실행 및 접속
  1. 컨테이너 실행: docker compose up -d
  2. 컨테이너 접속: docker exec -it open-llama-container bash

선행학습(pretrain) 진행

컨테이너에 접속한 뒤 다음과 같이 선행학습 명령 수행. 참고: 그래픽 카드 수에 따라 --num_processes 값 조정하여야 함.

$ accelerate launch --num_processes=1 --config_file configs/accelerate_configs/ds_stage1.yaml train_lm.py --train_config configs/pretrain_config.yaml --model_config configs/model_configs/7B.json

이슈 사항: GPU 기본 메모리에 비해 주어진 모델이 큼. 메모리와 관련된 옵션(배치사이즈, 단편화 등) 조정하였으나 모델 자체의 크기를 줄이지 않으면 안될 것으로 보임.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 13.11 GiB (GPU 0; 23.64 GiB total capacity; 13.15 GiB already allocated; 9.92 GiB free; 13.15 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF