duckdb
duckdb 的限制
- 可以多線程并發讀, 但不能多線程并發寫數據庫.
duckdb 的作用
- 數據交換格式, 尤其適合于用于較大的數據傳輸, 比csv格式更好, 有主外鍵約束, 有非空約束, 每列都有強數據類型, 避免出現臟數據, 列式數據庫文件壓縮效果好
- 數據處理引擎, 可以讀寫csv/json/parquet文件, 甚至支持RDBMS讀寫, 然后利用duckdb強大的SQL特性, 進行數據分析和處理.
- 數據探索工具, 可以非常容易集成到Jupyter Notebook中, 可以從Python DataFrame對象讀取, 然后利用強大的SQL特性, 進行數據分析. 也可以將sql的結果回寫到 DataFrame, 可以同時獲得SQL和DataFrame的數據探索的優勢.
duckdb 讀取 DataFrame 的示例
import duckdb
import pandas as pd
# Create a DataFrame, in this case using Pandas
my_df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
# Query it directly with SQL - no explicit conversion needed
result = duckdb.sql("SELECT * FROM my_df WHERE a > 1")
將 duckdb 轉成 DataFrame 并可視化
import duckdb
import polars as pl
import plotly.express as px
# Use SQL to load data and apply a complex aggregation
regional_summary = duckdb.query("""
SELECT
region,
SUM(sales) as total_sales,
COUNT(DISTINCT customer_id) as customer_count,
SUM(sales) / COUNT(DISTINCT customer_id) as sales_per_customer
FROM read_csv('sales_data*.csv')
WHERE sale_date >= '2024-01-01'
GROUP BY region
ORDER BY total_sales DESC
""").pl()
# Use summarised data in Polars for visualization
fig = px.bar(
regional_summary,
x="region",
y="sales_per_customer",
)
fig.show()
參考
https://endjin.com/blog/2025/04/duckdb-in-depth-how-it-works-what-makes-it-fast
https://endjin.com/blog/2025/04/duckdb-in-practice-enterprise-integration-architectural-patterns
https://www.timestored.com/data/duckdb/
https://github.com/davidgasquez/awesome-duckdb
建構數倉項目完整架構體系
https://dlthub.com/blog/dlt-motherduck-demo
https://datawise.dev/a-portable-data-stack-with-dagster-docker-duckdb-dbt-and-superset
參考博文1中給出了一個基于 dlt 和 dbt 和 duckdb 的數倉架構:



浙公網安備 33010602011771號