markitdown的安裝與簡單使用
歡迎閱讀我的blog!
個人網站閱讀《markitdown 的安裝和簡單使用》
安裝
microsoft/markitdown: Python tool for converting files and office documents to Markdown.
官網給出了兩個方法:
目前(20250224),直接使用pip install markitdown會安裝markitdown-0.0.1a4,而目前最新版是markitdown-0.0.2a1,故建議用第二種方法從源代碼進行構建:
git clone git@github.com:microsoft/markitdown.git
cd markitdown
pip install -e packages/markitdown
主要參數解釋
-h, --help:顯示幫助信息
-v, --version:顯示版本號
-o OUTPUT, --output OUTPUT:指定輸出文件名(如果不指定,將輸出到控制臺)
-d, --use-docintel:使用文檔智能服務來提取文本(需要有效的 Document Intelligence 端點)
-p, --use-plugins:使用第三方插件來轉換文件
--list-plugins:列出已安裝的第三方插件
使用
基本信息
在命令行輸入 markitdown -v 會輸出版本:
C:\Users\Vanilla>
markitdown 0.0.2a1
輸出幫助信息: markitdown -h
測試第三方插件:markitdown --list-plugins
docx文件測試
我選擇之前美賽的論文進行測試。
這份完整的數模論文該有的部件都有:公式、圖片、表格、題注、多級標題、加粗、斜體、鏈接、序號、頁眉;其中,行間公式使用的是mathtype,行內公式使用的是word自帶的公式編輯器。
執行命令
Measure-Command {
markitdown .\MCM-finish.docx -o docx.md
}
部分測試結果
摘要部分

提取結果:
Saving Juneau: Sustainable Development in Tourism
**Summary**
Excessive tourism in Juneau City has caused environmental and social challenges. To address these issues and promote sustainable development, we developed a multi-objective optimization model for sustainable tourism and applied it to Juneau City.
We constructed a general multi-objective optimization model with **tourist numbers** as the decision variable. The **objective function** integrates economic, environmental, and social factors, resulting in six goals. **Constraints** include carbon emissions, water resource utilization, and waste management. Further research will refine this model for application in other cities.
可以發現:
- 頁眉完全沒有被提取
- 標題
Saving Juneau: Sustainable Development in Tourism原本是標題,這里變成了正常文本 - 加粗正常
目錄部分

提取結果:
**Contents**
[1 Introduction 3](#_Toc188935048)
[1.1 Background 3](#_Toc188935049)
[1.2 Restatement of the problem 3](#_Toc188935050)
[1.3 Our works 4](#_Toc188935051)
[2 Model Preparation 4](#_Toc188935052)
[2.1 Assumptions and Justifications 4](#_Toc188935053)
[2.2 Notations 5](#_Toc188935054)
[3 Juneau: A Sustainable Tourism Model 6](#_Toc188935055)
- 原本目錄是可以跳轉的。可以發現,這里轉換的保留了跳轉域,但是完全不可用啊……
正文部分

轉換結果:
# Introduction
## Background

Figure :Current situation map of Juneau City[1]
In 2023, Juneau, Alaska, hosted 1.6 million cruise passengers, with a daily peak of up to 20,000 visitors. While this influx brought significant economic benefits, it also caused overcrowding and accelerated glacial retreat, impacting natural attractions and potentially deterring future tourists. Additionally, excessive tourism has increased hidden costs related to infrastructure strain, environmental damage, and social challenges.
可以發現:
-
一級、二級標題格式轉換正常
-
圖片似乎是想要轉換為base64的格式,但是
-
內容沒有發生轉換
-
圖片描述是word自動生成的一句提示“圖示描述已自動生成”,但是自動生成的描述去哪里了呢?
-
圖片描述中間還有兩個換行符是怎么回事
-
-
圖片題注變成了正常文本,但是圖片序號(包含域信息)消失了
-
引用直接變成了純文本
符號說明部分

轉換結果:
* **Assumption 2:** Ignoring the carbon footprint caused by tourists' use of transportation within the city of Juneau.
* Justification: Juneau has no direct roads. Most tourists choose cruise ships or planes to reach there. In contrast, the carbon footprint generated by tourists' sightseeing within the city can be negligible.
## Notations
| Notation | Description | Unit |
| --- | --- | --- |
| | Direct income from tourism | USD |
| | The i-th source of direct income from tourism | USD |
| | Tax revenue | USD |
| | Daily water consumption per tourist | L/person/day |
| | Carbon footprint | t |
可以發現:
- 序號轉換成功,這里使用的是
*,使用著 - 表格正常轉換
- 表格中最左邊一列是word公式,全部消失
附錄部分

# References
1. Background image source: Travel Juneau. (n.d.). *Home*. <https://www.traveljuneau.com/>
2. LSC Transportation Consultants, Inc. (2024). *Juneau visitor circulator study final report (Prepared for City and Borough of Juneau)*. <https://juneau.org/wp-content/uploads/2024/02/Juneau-Visitor-Circulator-Study-Final-Report-2024-1.pdf>
可以發現:
- 斜體正常
- 鏈接正常,但是這里直接使用了
<link>的方式而非markdown中更常用的[name](link)
測試總結
| 文檔部件 | 轉換情況 | 備注 |
|---|---|---|
| 文件類型 | docx | 最新版的word |
| 文件大小 | 16.3MB | 圖片較多,分辨率較大;25頁,計空格39578字 |
| 轉換耗時 | 2.4986917 | 可以說是挺快的了 |
| 公式 | × | 所有公式直接消失了 |
| 圖片 | × | 完全不可用 |
| 表格 | √ | |
| 題注 | √ | 變成文字 |
| 多級標題 | √ | 多級標題正常;普通標題變成正常文本 |
| 加粗 | √ | |
| 斜體 | √ | |
| 鏈接 | √ | |
| 序號 | √ | |
| 頁眉 | × | 消失 |
| 目錄 | √ | 正常文本,域跳轉不可用 |
pdf文件測試
執行命令
Measure-Command {
markitdown .\MCM-finish.pdf -o pdf.md
}
部分測試結果
摘要部分

提取結果:
Problem Chosen
X
2025
MCM/ICM
Summary Sheet
Team Control Number
XXXXXXX
Saving Juneau: Sustainable Development in Tourism
Summary
Excessive tourism in Juneau City has caused environmental and social challenges.
To address these issues and promote sustainable development, we developed a multi-
objective optimization model for sustainable tourism and applied it to Juneau City.
We constructed a general multi-objective optimization model with tourist
numbers as the decision variable. The objective function integrates economic,
environmental, and social factors, resulting in six goals. Constraints include carbon
emissions, water resource utilization, and waste management. Further research will
refine this model for application in other cities.
Task 1: We extended the model by adding sales tax and hotel tax as decision
variables and maximizing tax revenue with related constraints. Using literature review
and linear regression, we determine the values, estimate the parameters and applied the
NSGA-II algorithm to find Pareto optimal solutions. The entropy weight method
可以發現:
- 頁眉的文字也能夠轉換了,雖然格式有點亂,但是至少是有的
- 摘要部分的每一個自動換行都變成了換行符。這應該是與PDF存段落的方式(每行分開存儲)有關
- 沒有任何的格式(加粗沒了)
目錄部分

提取結果:
Team#XXXXXXX
Page 2 of 25
Contents
1
Introduction ..................................................................................................... 3
1.1 Background ......................................................................................... 3
1.2 Restatement of the problem ................................................................. 3
1.3 Our works ............................................................................................ 4
2 Model Preparation ........................................................................................... 4
2.1 Assumptions and Justifications ........................................................... 4
2.2 Notations ............................................................................................. 5
3 Juneau: A Sustainable Tourism Model ............................................................ 6
- 頁眉部分正常
- 沒有能夠跳轉域信息
- 所見即所得:PDF中的所有文本都被成功的轉換了,最大程度的保留了文本信息
正文部分

轉換結果:
1 Introduction
1.1 Background
Figure 1:Current situation map of Juneau City[1]
In 2023, Juneau, Alaska, hosted 1.6 million cruise passengers, with a daily peak
of up to 20,000 visitors. While this influx brought significant economic benefits, it also
caused overcrowding and accelerated glacial retreat, impacting natural attractions and
potentially deterring future tourists. Additionally, excessive tourism has increased
hidden costs related to infrastructure strain, environmental damage, and social
challenges.
可以發現:
-
沒有一級、二級標題格式,但是有一級、二級序號
-
圖片完全消失
-
圖片題注當然也沒有
-
引用變成了純文本
符號說明部分

轉換結果:
? Justification: Juneau has no direct roads. Most tourists choose cruise ships or
planes to reach there. In contrast, the carbon footprint generated by tourists'
sightseeing within the city can be negligible.
2.2 Notations
Notation
Description
Direct income from tourism
The i-th source of direct income from tourism
Tax revenue
Daily water consumption per tourist
……
Unit
USD
USD
USD
L/person/day
可以發現:
- 序號轉換成了不知道是個什么東西:
? - 表格格式轉換失敗,只有文字
- 表格中最左邊一列是word公式,全部消失
- 遵循的是先行再列而不是先列再行,不符合邏輯
附錄部分

References
[1] Background image source: Travel Juneau. (n.d.). Home. https://www.traveljuneau.com/
[2] LSC Transportation Consultants, Inc. (2024). Juneau visitor circulator study final repo
rt (Prepared for City and Borough of Juneau). https://juneau.org/wp-content/uploads/20
24/02/Juneau-Visitor-Circulator-Study-Final-Report-2024-1.pdf
可以發現:
- 斜體格式消失
- 鏈接有的正常有的不正常,因為換行會把鏈接截斷
- 鏈接沒有使用markdown的格式而是裸露的網址
測試總結
| 文檔部件 | 轉換情況 | 備注 |
|---|---|---|
| 文件類型 | ||
| 文件大小 | 5.42MB | 圖片較多,分辨率較大;25頁,計空格39578字 |
| 轉換耗時 | 12.411024 | 比word轉md慢,大約是其5倍 |
| 公式 | × | 所有公式直接消失了 |
| 圖片 | × | 圖片消失 |
| 表格 | × | 表格格式消失 |
| 題注 | √ | 變成文字 |
| 多級標題 | × | 變成(帶序號的)正常文本 |
| 加粗 | × | |
| 斜體 | × | |
| 鏈接 | × | 純文本,且會被換行截斷 |
| 序號 | × | 純文本 |
| 頁眉 | √ | 純文本 |
| 目錄 | √ | 純正常文本 |

浙公網安備 33010602011771號