匯總Ceph運維中遇到的問題
匯總Ceph運維中遇到的問題
1. 更換故障盤
1.1 查看故障盤osd id
ceph osd tree
1.2 移除故障盤
ceph osd out osd.60
ceph osd crush remove osd.60
ceph auth del osd.60
ceph osd rm osd.60
# ceph osd destroy 60 --yes-i-really-mean-it # 清空硬盤上的數據和元數據
# ceph osd purge 60 --yes-i-really-mean-it # 快速刪除osd,等同于crush remove+ auth del + osd rm
# destroy可以保留osd id;purge不保留osd id;新版本建議直接使用 ceph osd out && ceph osd purge 命令清除故障盤
ceph osd out osd.60
ceph osd purge 60 --yes-i-really-mean-it
1.3 下線故障硬盤
systemctl stop ceph-osd@60
umount /var/lib/ceph/osd/ceph-60
# 線下更換...
1.4 查看新硬盤盤符
lsblk
1.5 擦除新硬盤
ceph-volume lvm zap /dev/sde --destroy
# 如果更換的盤帶有邏輯卷信息(舊硬盤),需要先清除邏輯卷信息,使用如下命令
# dmsetup remove ceph--176f681d--9b18--4d97--ada4--077dcb507638-osd--block--0376cd87--c5c1--4a6b--ae7a--8dff088932a4
1.6 分配新id和fsid創建并啟動osd(合并以下7/8/9步驟,即 create 等同于prepare + activate,使用 prepare 和 activate 的好處是可以逐步地將新的 OSD 加入集群,避免大量的數據重平衡)
ceph-volume lvm create --data /dev/sde --bluestore
1.7 格式化lvm設備并將其與osd關聯(可選)
ceph-volume lvm prepare --osd-id 60 --bluestore --data /dev/sde
# ceph-volume lvm prepare --bluestore --data /path/to/device
# ceph-volume lvm prepare --filestore --data volume_group/lv_name --journal /dev/sdh
# ceph-volume lvm prepare --filestore --data volume_group/lv_name --journal volume_group/journal_lv
1.8 查看osd fsid(可選)
ceph-volume lvm list
cat /var/lib/ceph/osd/ceph-60/fsid
1.9 發現并掛載與osd id關聯的lvm設備并啟動osd(可選)
ceph-volume lvm activate 60 78341e1b-3cdf-466f-bdec-fc5b09192e35
1.10 手動將osd加入到crush map
ceph osd crush add osd.60 1.63699 root=stat
ceph osd crush add osd.60 1.63699 host=stat_06
2. 擴容osd節點
scp master01:/etc/yum.repos.d/ceph_stable.repo /etc/yum.repos.d/ceph_stable.repo
scp master01:/etc/ceph/ceph.conf /etc/ceph
scp master01:/etc/ceph/ceph.client.admin.keyring /etc/ceph
scp master01:/var/lib/ceph/bootstrap-osd/ceph.keyring /var/lib/ceph/bootstrap-osd
yum install -y ceph-osd
ceph-volume lvm zap /dev/sdb --destroy
ceph-volume lvm zap /dev/sdc --destroy
ceph-volume lvm zap /dev/sdd --destroy
ceph-volume lvm zap /dev/sde --destroy
ceph-volume lvm create --data /dev/sdb --bluestore --block.db /dev/sdf --block.wal /dev/sdg --block.db-size 20G --block.wal-size 20G
ceph-volume lvm create --data /dev/sdc --bluestore --block.db /dev/sdf --block.wal /dev/sdg --block.db-size 20G --block.wal-size 20G
ceph-volume lvm create --data /dev/sdd --bluestore --block.db /dev/sdf --block.wal /dev/sdg --block.db-size 20G --block.wal-size 20G
ceph-volume lvm create --data /dev/sde --bluestore --block.db /dev/sdf --block.wal /dev/sdg --block.db-size 20G --block.wal-size 20G
# 查看硬盤對應的osd id號,并向非默認規則集中增加硬盤(根據自己的crush規則自行設置)
ceph-volume lvm list
ceph osd crush add osd.53 1.63699 root=stat
ceph osd crush add osd.53 1.63699 host=stat_06
3. 縮容osd節點
3.1 停止所有osd服務
systemctl stop ceph-osd@13.server
systemctl stop ceph-osd@14.server
systemctl stop ceph-osd@15.server
systemctl stop ceph-osd@16.server
3.2 銷毀所有osd
ceph osd purge 13 --yes-i-really-mean-it
ceph osd purge 14 --yes-i-really-mean-it
ceph osd purge 15 --yes-i-really-mean-it
ceph osd purge 16 --yes-i-really-mean-it
3.3 擦除磁盤數據
ceph-volume lvm zap --osd-id 13 --destroy
ceph-volume lvm zap --osd-id 14 --destroy
ceph-volume lvm zap --osd-id 15 --destroy
ceph-volume lvm zap --osd-id 16 --destroy
3.4 清除crush數據
ceph osd crush tree
ceph osd crush rm node03
3.5 刪除osd應用
yum remove -y ceph-osd ceph-common
4. 調整PG
Total PGs = ((Total_number_of_OSD * 100) / max_replication_count) / pool_count, 結算的結果往上取靠近2的N次方的值。
ceph osd lspools
ceph osd pool get pool1 all
ceph osd pool set pool1 pg_num 2048
ceph osd pool set pool1 pgp_num 2048
5. 報警提示:1 daemons have recently crashed
ceph crash ls-new
ceph crash info
ceph crash archive-all
6. 多路徑創建邏輯卷失敗的問題:
RuntimeError: Cannot use device (/dev/mapper/mpathh). A vg/lv path or an existing device is needed
https://tracker.ceph.com/issues/23337
systemctl reload multipathd.service
multipath -ll
# 修改ceph-volume的源代碼文件
vim /usr/lib/python2.7/site-packages/ceph_volume/util/disk.py
# ... src/ceph-volume/util/disk.py
# use lsblk first, fall back to using stat
TYPE = lsblk(dev).get('TYPE')
if TYPE:
return TYPE == 'disk' or TYPE == 'mpath'
# 重新執行ceph-volume命令
ceph-volume lvm create --bluestore --data /dev/mapper/mpathh
ceph-volume lvm ls
7. 磁盤有分區表導致無法創建osd:
error: GPT headers found, they must be removed on: /dev/dm-10
sgdisk --zap-all /dev/dm-10
# 再次創建
ceph-volume lvm create --data /dev/dm-10 --bluestore
8. 集群未清理舊硬盤的認證信息導致新建osd使用與舊硬盤相同的id時報錯:
Error EINVAL: entity osd.90 exists but key does not match
ceph auth rm osd.90
9. 重建mon節點
9.1 查看仲裁狀態
ceph mon stat
9.2 移除MON節點
ceph mon remove node03
# ceph mon add node03 192.168.100.103:6789 用于增加mon節點
9.3 刪除MON數據目錄
rm -rf /var/lib/ceph/mon/ceph-node03/
9.4 獲取并查看集群MON的map和keyring
ceph mon getmap -o /tmp/monmap
ceph auth get mon. -o /tmp/keyring
monmaptool --print /tmp/monmap
cat /tmp/keyring
9.5 通過map和keyring重建MON數據目錄
ceph-mon --id node03 --mkfs --monmap /tmp/monmap --keyring /tmp/keyring
chown -R ceph:ceph /var/lib/ceph/mon/ceph-node03/
9.6 重啟MON進程
systemctl reset-failed ceph-mon@node03.service
systemctl restart ceph-mon@node03.service
systemctl status ceph-mon@node03.service
9.7 查詢節點MON狀態
ceph daemon mon.node03 mon_status
10.ceph-deploy報錯:[ceph_deploy][ERROR ] UnsupportedPlatform: Platform is not supported
修改site-packages/ceph_deploy/hosts/__init__.py中的get、_get_distro和_normalized_distro_name三個函數
def get(hostname,
...
distro_name, release, codename = conn.remote_module.platform_information()
print(f"DEBUG: distro_name={distro_name}, release={release}, codename={codename}") # 增加調試
# if not codename or not _get_distro(distro_name):
if not _get_distro(distro_name): # 去掉codename判斷
raise exc.UnsupportedPlatform(
distro=distro_name,
codename=codename,
release=release)
...
def _get_distro(distro, fallback=None, use_rhceph=False):
...
distributions = {
'debian': debian,
'ubuntu': debian,
'centos': centos,
'scientific': centos,
'oracle': centos,
'redhat': centos,
'fedora': fedora,
'openeuler': centos,
'rocky': centos, # 增加所在發行版
'suse': suse,
'virtuozzo': centos,
'arch': arch,
'alt': alt,
'clear': clear
}
...
def _normalized_distro_name(distro):
distro = distro.lower()
if distro.startswith(('redhat', 'red hat')):
return 'redhat'
elif distro.startswith(('scientific', 'scientific linux')):
return 'scientific'
elif distro.startswith('oracle'):
return 'oracle'
elif distro.startswith(('suse', 'opensuse', 'sles')):
return 'suse'
elif distro.startswith(('centos', 'euleros', 'openeuler', 'rocky')): # 增加所在發行版
return 'centos'
...

浙公網安備 33010602011771號