dask集群搭建簡介(一)
介紹
Dask本質上由兩部分構成:動態計算調度、集群管理,高級Dataframe api模塊;類似于spark與pandas。Dask內部實現了分布式調度,無需用戶自行編寫復雜的調度邏輯和程序,通過簡單的方法實現了分布式計算,支持部分模型并行處理(例如分部署算法:xgboost、LR、sklearn等)。Dask 專注于數據科學領域,與Pandas非常接近,但并不完全兼容。
集群搭建:
在Dask集群中,存在多種角色:client,scheduler, worker
- client: 用于客戶client與集群之間的交互
- scheduler:主節點(集群的注冊中心)管理點,負責client提交的任務管理,以不同策略分發不同worker節點
- worker:工作節點,受scheduler管理,負責數據計算
1. 主節點(scheduler):
- scheduler:默認端口8786
a. 依賴包:dask、distributed
b. 安裝:pip install dask distributed
c. 啟動:dask-scheduler
distributed.scheduler - INFO - -----------------------------------------------
distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://192.168.1.21:8786
distributed.scheduler - INFO - dashboard at: :8787
- web UI:默認端口8787
a. web 登錄提示:需要安裝依賴項( bokeh )
b. 安裝:pip install bokeh>=0.13.0
c. 界面效果:
![]()
2. 工作節點(worker):
a. 依賴包:dask、distributed
b. 安裝:pip install dask distributed
c. 啟動:以192.168.1.22 為例,192.168.1.23雷同
> dask-worker 192.168.1.21:8786
distributed.nanny - INFO - Start Nanny at: 'tcp://192.168.1.22:36803'
distributed.worker - INFO - Start worker at: tcp://192.168.1.22:37089
distributed.worker - INFO - Listening to: tcp://192.168.1.22:37089
distributed.worker - INFO - dashboard at: 192.168.1.22:36988
distributed.worker - INFO - Waiting to connect to: tcp://192.168.1.21:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 24
distributed.worker - INFO - Memory: 33.52 GB
distributed.worker - INFO - Local Directory: /home/binger/dask-server/dask-worker-space/worker-ntrdwzqp
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://192.168.1.21:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
主節點變化:
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://192.168.1.21:8786
distributed.scheduler - INFO - dashboard at: :8787
distributed.scheduler - INFO - Register worker <Worker 'tcp:/192.168.1.22:37089', name: tcp://192.168.1.22:37089, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://192.168.1.22:37089
distributed.core - INFO - Starting established connection
3. dask-scheduler 啟動失敗:ValueError: 'default' must be a list when 'multiple' is true.
Traceback (most recent call last):
File "D:\Program Files\Python36\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "D:\Program Files\Python36\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "E:\workspace\ceshi\venv\Scripts\dask-scheduler.exe\__main__.py", line 4, in <module>
File "e:\workspace\ceshi\venv\lib\site-packages\distributed\cli\dask_scheduler.py", line 122, in <module>
@click.version_option()
File "e:\workspace\ceshi\venv\lib\site-packages\click\decorators.py", line 247, in decorator
_param_memo(f, OptionClass(param_decls, **option_attrs))
File "e:\workspace\ceshi\venv\lib\site-packages\click\core.py", line 2465, in __init__
super().__init__(param_decls, type=type, multiple=multiple, **attrs)
File "e:\workspace\ceshi\venv\lib\site-packages\click\core.py", line 2101, in __init__
) from None
ValueError: 'default' must be a list when 'multiple' is true.
解決辦法:修改click 版本<8.0
pip install "click>=7,<8"


浙公網安備 33010602011771號