搞不定 Docker 部署 SeaTunnel?這些坑與解法你得知道!
在大數據處理的浩瀚宇宙中,數據集成堪稱連接各個星系的引力紐帶,其重要性不言而喻。而 SeaTunnel,作為這一領域的璀璨新星,正憑借其卓越特性閃耀登場。它是一個極為易用且具備超高性能的分布式數據集成平臺,肩負著實時海量數據同步的重任,每日穩定高效地穿梭于數百億數據之間,已然成為近百家企業生產線上的得力助手。
一、SeaTunnel:數據集成的璀璨明珠
直擊數據集成痛點
- 數據源多樣的迷宮:常用數據源宛如繁星,多達數百種,且版本各異,相互之間存在兼容性的暗礁。隨著科技浪潮的推進,新數據源如雨后春筍般不斷涌現,尋覓一款能全面、迅速支持這些數據源的工具,猶如大海撈針。
- 同步場景復雜的棋局:數據同步的戰場需要應對離線全量同步、離線增量同步、CDC、實時同步、全庫同步等多種復雜局勢,每一種場景都有其獨特的戰術要求。
- 資源需求高的重擔:現有的數據集成工具在面對海量小表的實時同步時,往往如同貪婪的巨獸,吞噬大量計算資源或 JDBC 連接資源,給企業帶來沉重的成本負擔。
- 缺乏質量和監控的黑洞:數據集成過程中,數據丟失或重復的幽靈時常作祟,且同步過程缺乏有效的監控手段,難以直觀洞察任務中數據的真實狀況。
- 技術棧復雜的荊棘叢:企業技術組件的多樣性,使得用戶需要針對不同組件開發各自的同步程序,如同在荊棘叢中艱難前行。
- 管理和維護困難的高山:由于底層技術組件(Flink/Spark)的差異,離線同步和實時同步通常需要分開開發與管理,猶如攀登陡峭的高山,增加了運維的難度。
SeaTunnel的閃耀特性
- 豐富且可擴展的 Connector 生態:SeaTunnel 精心打造了不依賴特定執行引擎的 Connector API。基于此 API 開發的 Connector(Source、Transform、Sink),如同擁有魔法翅膀,能夠在眾多不同引擎上翱翔,如目前支持的 SeaTunnel 引擎(Zeta)、Flink、Spark 等。
- 插件式設計的便捷舞臺:插件式設計為用戶提供了便捷的創作舞臺,可輕松開發自己的 Connector,并將其無縫集成到 SeaTunnel 項目中。目前,SeaTunnel 支持的連接器已超 100 個,且數量仍在持續激增。
- 批流集成的和諧樂章:基于 SeaTunnel Connector API 開發的 Connector,完美兼容離線同步、實時同步、全量同步、增量同步等多種場景,如同奏響一曲和諧的樂章,大大降低了管理數據集成任務的難度。
- 分布式快照算法的數據一致性保障:支持分布式快照算法,如同為數據一致性上了一把堅固的鎖,確保數據在流轉過程中的準確與完整。
- 多引擎支持的靈活選擇:SeaTunnel 默認使用 SeaTunnel 引擎(Zeta)進行數據同步,但同時也貼心支持使用 Flink 或 Spark 作為 Connector 的執行引擎,以適配企業現有的技術組件,并且對 Spark 和 Flink 的多個版本都提供良好的兼容性。
- JDBC 復用與數據庫日志多表解析的智慧方案:支持多表或全庫同步,巧妙解決了過度 JDBC 連接的難題;支持多表或全庫日志讀取解析,有效避免了 CDC 多表同步場景下日志重復讀取解析的困境。
- 高吞吐量、低延遲的速度傳奇:支持并行讀寫,具備穩定可靠、高吞吐量、低延遲的數據同步能力,如同高速列車在數據軌道上飛馳。
- 完善的實時監控的千里眼:支持數據同步過程中每一步的詳細監控信息,為用戶提供了一雙 “千里眼”,能夠輕松了解同步任務讀寫的數據數量、數據大小、QPS 等關鍵信息。
盡管 SeaTunnel 功能強大,但在使用 Docker 部署時,仍可能遭遇諸多棘手問題。接下來,讓我們一同深入探討這些問題及其解決方案。
二、Docker 部署 SeaTunnel 的官方方式
官方提供了三種部署方式,分別是 Locally、Docker 部署和 K8S 部署。本文將聚焦于 Docker 部署方式,通過官方提供的 docker - compose 來部署 SeaTunnel,官方示例如下:
version: '3'
services:
master:
image: apache/seatunnel
container_name: seatunnel_master
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r master
"
ports:
- "5801:5801"
networks:
seatunnel_network:
ipv4_address: 172.16.0.2
worker1:
image: apache/seatunnel
container_name: seatunnel_worker_1
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.3
worker2:
image: apache/seatunnel
container_name: seatunnel_worker_2
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.4
networks:
seatunnel_network:
driver: bridge
ipam:
config:
- subnet: 172.16.0.0/24
三、Docker 部署 Seatunnel 的常見 “坑” 及解決方案
坑一:鏡像下載的攔路虎
問題描述:當嘗試下載 apache/seatunnel 鏡像時,默認的完整路徑 docker.io/apache/seatunnel 在國內無法訪問,導致鏡像下載失敗,部署進程被迫中斷。
解決方案:
1、臨時方案 - 快捷繞道:將鏡像名稱臨時修改為 docker.1ms.run/apache/seatunnel,即可快速解決燃眉之急,繼續推進部署工作。
2、永久方案 - 徹底疏通:
a、修改 /etc/docker/daemon.json,設置 registry mirror:
sudo vim /etc/docker/daemon.json
{
"registry-mirrors": [
"https://docker.1ms.run",
"https://docker.xuanyuan.me"
]
}
b、重啟docker:
systemctl daemon-reload
systemctl restart docker
注: 更多可用鏡像源,可以查看這篇博文https://xuanyuan.me/blog/archives/1154
坑二:日志文件混亂的迷局
問題描述:SeaTunnel 默認采用配置混合日志文件的方式,所有作業日志一股腦地輸出到 SeaTunnel Engine 系統日志文件中,這使得日志查找與分析變得極為困難,如同在雜亂無章的倉庫中尋找特定物品。
解決方案: 通過更新 log4j2.properties 文件中的配置,為每個作業生成單獨的日志文件,讓日志管理變得井然有序。只需將配置修改為 rootLogger.appenderRef.file.ref = routingAppender ,此后,每個作業便會擁有自己獨立的日志文件,如 job - xxx1.log、job - xxx2.log、job - xxx3.log 等。為使配置生效,需將更新后的 log4j2.properties 文件掛載到容器中。以下是更新后的 docker - compose 配置示例:
version: '3'
services:
master:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_master
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
volumes:
# 掛載日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r master
"
ports:
- "5801:5801"
networks:
seatunnel_network:
ipv4_address: 172.16.0.2
worker1:
image: apache/seatunnel
container_name: seatunnel_worker_1
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.3
worker2:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_worker_2
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
volumes:
# 掛載日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.4
networks:
seatunnel_network:
driver: bridge
ipam:
config:
- subnet: 172.16.0.0/24
坑三:RESTful API V2 訪問的障礙
問題描述:在部署完成后,嘗試訪問 RESTful API V2 卻發現無法連接,無法通過 API 對 SeaTunnel 進行便捷管理與操作。
解決方案:確保在兩個關鍵環節進行正確配置。首先,在 seatunnel.yaml 文件中,開啟相關配置:
seatunnel:
engine:
http:
enable-http: true
port: 8080
enable-dynamic-port: true
port-range: 100
其次,在 docker - compose 文件中,將 http 端口暴露出來。以下是完整的 docker - compose 配置示例:
version: '3'
services:
master:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_master
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
volumes:
# 掛載日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
- ./config/seatunnel.yaml:/opt/seatunnel/config/seatunnel.yaml
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r master
"
ports:
- "5801:5801"
- "8080:8080"
networks:
seatunnel_network:
ipv4_address: 172.16.0.2
worker1:
image: apache/seatunnel
container_name: seatunnel_worker_1
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.3
worker2:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_worker_2
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
volumes:
# 掛載日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
- ./config/seatunnel.yaml:/opt/seatunnel/config/seatunnel.yaml
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.4
networks:
seatunnel_network:
driver: bridge
ipam:
config:
- subnet: 172.16.0.0/24
坑四:監控指標失效的謎團
問題描述:配置了監控相關設置后,卻發現監控指標并未生效,無法獲取數據同步過程中的關鍵監控信息,對任務運行狀態的掌控猶如盲人摸象。
解決方案:仔細檢查 seatunnel.yaml 文件中監控的相關設置,確保如下配置正確無誤:
seatunnel:
engine:
telemetry:
metric:
enabled: true
經此設置,監控指標便可正常生效,為您實時反饋數據同步的運行狀況。
坑五:控制臺日志時間錯誤的困惑
問題描述:查看控制臺日志時,發現日志時間與實際時間不符,這為排查問題和分析任務執行順序帶來極大困擾。
解決方案:在 docker - compose 配置中設置正確的時區。以下是添加時區配置后的 docker - compose 示例:
version: '3'
services:
master:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_master
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
- TZ=Asia/Shanghai
volumes:
# 掛載日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
- ./config/seatunnel.yaml:/opt/seatunnel/config/seatunnel.yaml
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r master
"
ports:
- "5801:5801"
- "8080:8080"
networks:
seatunnel_network:
ipv4_address: 172.16.0.2
worker1:
image: apache/seatunnel
container_name: seatunnel_worker_1
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.3
worker2:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_worker_2
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
- TZ=Asia/Shanghai
volumes:
# 掛載日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
- ./config/seatunnel.yaml:/opt/seatunnel/config/seatunnel.yaml
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.4
networks:
seatunnel_network:
driver: bridge
ipam:
config:
- subnet: 172.16.0.0/24
通過設置時區為 Asia/Shanghai,控制臺日志時間將恢復正常,為您提供準確的時間參考。
坑六:容器重啟后元數據丟失的困境
問題描述:當容器重啟后,元數據如集群的狀態數據(作業運行狀態、資源狀態)、每個任務及其 task 的狀態全部丟失,這對于需要持續穩定運行的生產環境來說,無疑是一場災難。
解決方案:默認 SeaTunnel Engine 將數據存儲在 Imap 中,因此需要對 IMap 進行持久化處理。由于官方推薦采用分離模式集群模式部署,在此模式下,只有 Master 節點存儲 Imap 數據,Worker 節點不存儲。所以,我們只需修改 hazelcast - master.yaml 文件。本文以 minio 作為存儲 Imap 數據的對象存儲,在 hazelcast - master.yaml 文件中新增如下內容:
map:
engine*:
map-store:
enabled: true
initial-mode: EAGER
factory-class-name: org.apache.seatunnel.engine.server.persistence.FileMapStoreFactory
properties:
type: hdfs
namespace: /seatunnel/imap
clusterName: seatunnel-cluster
storage.type: s3
s3.bucket: s3a://seatunnel-dev
fs.s3a.access.key: etoDbE8uGdpg3ED8
fs.s3a.secret.key: 6hkb90nPCaMrBcbhN1v5iC0QI0MeXDOk
fs.s3a.endpoint: http://10.1.4.155:9000
fs.s3a.aws.credentials.provider: org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
同時,將 hazelcast - master 文件掛載到容器中。以下是更新后的 docker - compose 配置示例:
version: '3'
services:
master:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_master
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
- TZ=Asia/Shanghai
volumes:
# 掛載日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
- ./config/seatunnel.yaml:/opt/seatunnel/config/seatunnel.yaml
# 配置元數據持久化(存儲每個任務及其task的狀態,以便在任務所在節點宕機后,可以在其他節點上獲取到任務之前的狀態信息,從而恢復任務實現任務的容錯):https://seatunnel.apache.org/zh-CN/docs/2.3.9/seatunnel-engine/separated-cluster-deployment
- ./config/hazelcast-master.yaml:/opt/seatunnel/config/hazelcast-master.yaml
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r master
"
ports:
- "5801:5801"
- "8080:8080"
networks:
seatunnel_network:
ipv4_address: 172.16.0.2
worker1:
image: apache/seatunnel
container_name: seatunnel_worker_1
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.3
worker2:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_worker_2
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
- TZ=Asia/Shanghai
volumes:
# 掛載日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
- ./config/seatunnel.yaml:/opt/seatunnel/config/seatunnel.yaml
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.4
networks:
seatunnel_network:
driver: bridge
ipam:
config:
- subnet: 172.16.0.0/24
坑七:容器重啟后檢查點丟失的難題
問題描述:與元數據丟失類似,容器重啟后檢查點也隨之丟失,這嚴重影響了數據同步任務的連續性與可靠性,可能導致數據不一致等問題。
解決方案:將檢查點存儲到對象存儲中,以 minio 為例,在 seatunnel.yaml 中進行如下配置:
checkpoint:
interval: 10000
timeout: 60000
storage:
type: hdfs
max-retained: 3
plugin-config:
storage.type: s3
s3.bucket: s3a://seatunnel-dev
fs.s3a.access.key: ST4HTeGdARHk7Drf
fs.s3a.secret.key: zyiJYIpYy0ewiozse6kSLIQG62vO9IUh
fs.s3a.endpoint: http://10.1.4.155:9000
fs.s3a.aws.credentials.provider: org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
四、總結
SeaTunnel 作為國人主導的 Apache 開源項目,其文檔和代碼相對易于理解。然而,在實際部署過程中,確實會遇到各種復雜問題。上述提及的諸多坑點,其實在官方文檔中均能找到解決思路,只是目前官方文檔的組織可能稍顯繁雜,需要讀者仔細研讀、深度挖掘。
為方便大家參考,這里附上完整的 docker-compose 配置,希望能助力各位在 SeaTunnel 的部署征程中一帆風順。
version: '3'
services:
master:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_master
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
- TZ=Asia/Shanghai
volumes:
# 掛載日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
- ./config/seatunnel.yaml:/opt/seatunnel/config/seatunnel.yaml
# 配置元數據持久化(存儲每個任務及其task的狀態,以便在任務所在節點宕機后,可以在其他節點上獲取到任務之前的狀態信息,從而恢復任務實現任務的容錯):https://seatunnel.apache.org/zh-CN/docs/2.3.9/seatunnel-engine/separated-cluster-deployment
- ./config/hazelcast-master.yaml:/opt/seatunnel/config/hazelcast-master.yaml
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r master
"
ports:
- "5801:5801"
- "8080:8080"
networks:
seatunnel_network:
ipv4_address: 172.16.0.2
worker1:
image: apache/seatunnel
container_name: seatunnel_worker_1
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.3
worker2:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_worker_2
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
- TZ=Asia/Shanghai
volumes:
# 掛載日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
- ./config/seatunnel.yaml:/opt/seatunnel/config/seatunnel.yaml
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.4
networks:
seatunnel_network:
driver: bridge
ipam:
config:
- subnet: 172.16.0.0/24
希望這篇文章能成為您在 Docker 部署 SeaTunnel 過程中的得力指南,幫助您順利跨越重重障礙,充分發揮 SeaTunnel 強大的數據集成能力。如果您在閱讀過程中有任何疑問,或者發現新的問題,歡迎在評論區留言分享。同時,如果您覺得本文對您有所幫助,別忘了點贊、轉發,讓更多的人受益于這份實戰經驗總結。

浙公網安備 33010602011771號