环境说明
- 系统:Ubuntu 22.04 Server
- 驱动:nvidia-driver-595-server(apt 安装)
- CUDA:12.6 官方 apt 源安装
- 网络:Netplan 静态 IP,禁用 cloud-init 防覆盖
- 容器:Docker + nvidia-container-toolkit GPU 直通
一、系统基础初始化
## 1. 更新系统 & 安装基础依赖
apt update && apt full-upgrade -y
apt install linux-headers-$(uname -r) build-essential dkms gcc make -y
## 2. 禁用开源 nouveau 驱动
cat << EOF > /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
EOF
update-initramfs -u
reboot
#查看SecureBoot 安全启动是否被禁用
mokutil --sb-state
如果安全启动没有禁用,需要重启进 BIOS 关闭 Secure Boot,否则 NVIDIA 模块无法加载
二、APT 安装 NVIDIA 595 服务器驱动
1. 清理旧残留
apt purge "*nvidia*" "*cuda*" -y
apt autoremove -y
apt autoclean
rm -rf /var/lib/dkms/nvidia* /usr/src/nvidia-*
2. 安装驱动
apt install nvidia-driver-595-server -y
reboot
3. 验证驱动
nvidia-smi
三、配置 Netplan 静态 IP
## 1. 禁用 cloud-init 网络自动覆盖
cat << EOF > /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg
network: {config: disabled}
EOF
## 2. 编辑 Netplan 配置文件
vi /etc/netplan/50-cloud-init.yaml
#写入内容:
network:
version: 2
renderer: networkd
ethernets:
eno0:
optional: true
enp96s0f1:
optional: true
enp218s0f0:
optional: true
enp218s0f1:
dhcp4: false
addresses: [192.168.0.8/24]
routes:
- to: default
via: 192.168.0.254
nameservers:
addresses: [192.168.0.254, 114.114.114.114]
## 3. 应用网络配置
netplan try
netplan apply
## 4. 验证网络
ip a
ip route show default
ping -c 3 baidu.com
四、APT 方式安装 CUDA 12.6
cd /tmp/
# 1. 添加 NVIDIA 官方 CUDA 仓库
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb
apt update
# 2. 安装 CUDA Toolkit 12.6(不含驱动,安全不覆盖)
apt install cuda-toolkit-12-6 -y
# 3. 配置全局环境变量(永久生效)
echo 'export PATH=/usr/local/cuda-12.6/bin:$PATH' >> /etc/profile
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.6/lib64:$LD_LIBRARY_PATH' >> /etc/profile
source /etc/profile
# 4. 验证 CUDA 安装
nvcc -V
五、APT 安装 Docker 完整版
## 1. 卸载旧版本 Docker
apt remove docker docker-engine docker.io containerd runc -y
apt autoremove -y
2. 安装依赖
apt install ca-certificates curl gnupg lsb-release -y
3. 添加 Docker 官方 APT 源
mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null
4. 安装 Docker 全套组件
apt update
apt install docker-ce docker-ce-cli containerd.io docker-compose-plugin -y
5. 开机自启并启动
systemctl enable docker
systemctl start docker
6. 配置 Docker 国内镜像加速
cat > /etc/docker/daemon.json <<EOF
{
"registry-mirrors": [
"https://docker.mirrors.ustc.edu.cn",
"https://hub-mirror.c.163.com",
"https://mirror.baidubce.com"
]
}
EOF
systemctl daemon-reload
systemctl restart docker
六、APT 安装 NVIDIA Container Toolkit(Docker GPU 支持)
1. 导入密钥与软件源
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
2. 安装工具包
apt update
apt install nvidia-container-toolkit -y
3. 配置 Docker 启用 NVIDIA 运行时
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker
七、环境验证
1. 查看 Docker 是否识别全部 GPU
docker info | grep -i nvidia
八、环境清单
系统:Ubuntu 22.04 完整版更新
驱动:nvidia-driver-595-server 多卡正常识别
CUDA:12.6 apt 官方源安装
网络:静态 IP 192.168.0.6 永久固定
容器:Docker 最新版 + 国内镜像加速
能力:Docker 容器可直通全部GPU