用Python編寫博客導出工具

我在 github 上用 octopress 搭建了個人博客，octopress 使用Markdown語法編寫博文。之前我在CSDN博客上也寫過不少的技術博文，都說自己的孩子再丑也是個寶，所以就起了把CSDN博客里面的文章導出到個人博客上的念頭。剛開始想找個工具把CSDN博客導出為xml或文本，然后再把xml或文本轉換為Markdown博文。可惜搜了一下現有博客導出工具，大部分要收費才能將全部博文導出為xml格式，所以就只好發明輪子了：寫個工具將全部博文導出為Markdown博文（也是txt格式的）。我將詳細介紹這個工具的編寫過程，希望沒有學習過編程的人也能夠學會一些簡單的Python語法來修改這個腳本工具，以滿足他們將其他類型的博客導出為文本格式。這也是我第一次學習和使用Python，所以相信我，你一定也可以將自己的博客導出為想要的文本格式。本文源代碼在這里：ExportCSDNBlog.py

用Python編寫博客導出工具

羅朝輝 (http://kesalin.github.io/)

CC 許可，轉載請注明出處

寫在前面的話

我將詳細介紹這個工具的編寫過程，希望沒有學習過編程的人也能夠學會一些簡單的Python語法來修改這個腳本工具，以滿足他們將其他類型的博客導出為文本格式。這也是我第一次學習和使用Python，所以相信我，你一定也可以將自己的博客導出為想要的文本格式。

本文源代碼在這里：ExportCSDNBlog.py

考慮到大部分非程序員使用Windows系統，下面將介紹在Windows下如何編寫這個工具。

下載工具

在 Windows 下安裝Python開發環境（Linux/Mac下用pip安裝相應包即可，程序員自己解決咯）：

Python 2.7.3
請安裝這個版本，更高版本的Python與一些庫不兼容。
下載頁面
下載完畢雙擊可執行文件進行安裝，默認安裝在C:\Python2.7。

six
下載頁面下載完畢，解壓到Python安裝目錄下，如C:\Python2.7\six-1.8.0目錄下。

BeautifulSoup 4.3.2
下載頁面，下載完畢，解壓到Python安裝目錄下，如C:\Python2.7\BeautifulSoup目錄下。

html5lib
下載頁面下載完畢，解壓到Python安裝目錄下，如C:\Python2.7\html5lib-0.999目錄下。

安裝工具

Windows下啟動命令行，依次進入如下目錄，執行setup.py install進行安裝：

C:\Python2.7\six-1.8.0>setup.py install  
C:\Python2.7\html5lib-0.999>setup.py install  
C:\Python2.7\BeautifulSoup>setup.py install

參考文檔

Python 2.X文檔
 BeautifulSoup文檔
 正則表達式文檔
 正則表達式在線測試

用到的Python語法

這個工具只用到了一些基本的Python語法，如果你沒有Python基礎，稍微了解一下如下博文是很有好處的。

string: 字符串操作，參考python: string的操作函數
list: 列表操作，參考Python list 操作
dictionary: 字典操作，參考Python中dict詳解
datetime: 日期時間，參考python datetime處理時間

編寫博客導出工具

分析

首先來分析這樣一個工具的需求：

導出所有CSDN博客文章為Markdown文本。

這個總需求其實可以分兩步來做：

* 獲得CSDN博客文章
* 將文章轉換為Markdown文本

針對第一步：如何獲取博客文章呢？

打開任何一個CSDN博客，我們都可以看到下方的頁面導航顯示“XXX條數據共XXX頁 1 2 3 … 尾頁”，我們從這個地方入手考慮。每個頁面上都會顯示屬于該頁的文章標題及文章鏈接，如果我們依次訪問這些頁面鏈接，就能從每個頁面鏈接中找出屬于該頁面的文章標題及文章鏈接。這樣所有的文章標題以及文章鏈接就都獲取到了，有了這些文章鏈接，我們就能獲取對應文章的html內容，然后通過解析這些html頁面來生成相應Markdown文本了。

實現

從上面的分析可以看出，首先我們需要根據首頁獲取所有的頁面鏈接，然后遍歷每一個頁面鏈接來獲取文章鏈接。

獲取頁面鏈接的代碼：

   def getPageUrlList(url):
      # 獲取所有的頁面的 url
      request = urllib2.Request(url, None, header)
      response = urllib2.urlopen(request)
      data = response.read()

      #print data
      soup = BeautifulSoup(data)

      lastArticleHref = None
      pageListDocs = soup.find_all(id="papelist")
      for pageList in pageListDocs:
          hrefDocs = pageList.find_all("a")
          if len(hrefDocs) > 0:
              lastArticleHrefDoc = hrefDocs[len(hrefDocs) - 1]
              lastArticleHref = lastArticleHrefDoc["href"].encode('UTF-8')

      if lastArticleHref == None:
          return []
  
      #print " > last page href:" + lastArticleHref
      lastPageIndex = lastArticleHref.rfind("/")
      lastPageNum = int(lastArticleHref[lastPageIndex+1:])
      urlInfo = "http://blog.csdn.net" + lastArticleHref[0:lastPageIndex]

      pageUrlList = []
      for x in xrange(1, lastPageNum + 1):
          pageUrl = urlInfo + "/" + str(x)
          pageUrlList.append(pageUrl)
          log(" > page " + str(x) + ": " + pageUrl)

      log("total pages: " + str(len(pageUrlList)) + "\n")
      return pageUrlList

參數 url = “http://blog.csdn.net/” + username，即你首頁的網址。通過urllib2庫打開這個url發起一個web請求，從response中獲取返回的html頁面內容保存到data中。你可以被注釋的 print data 來查看到底返回了什么內容。

有了html頁面內容，接下來就用BeautifulSoup來解析它。BeautifulSoup極大地減少了我們的工作量。我會詳細在這里介紹它的使用，后面再次出現類似的解析就會從略了。soup.find_all(id=“papelist”) 將會查找html頁面中所有id=“papelist”的tag，然后返回包含這些tag的list。對應 CSDN 博文頁面來說，只有一處地方：

<div id="papelist" class="pagelist">
  <span> 236條數據  共12頁</span>
  <strong>1</strong>
  <a href="/kesalin/article/list/2">2</a>
  <a href="/kesalin/article/list/3">3</a>
  <a href="/kesalin/article/list/4">4</a>
  <a href="/kesalin/article/list/5">5</a>
  <a href="/kesalin/article/list/6">...</a>
  <a href="/kesalin/article/list/2">下一頁</a>
  <a href="/kesalin/article/list/12">尾頁</a>
</div>

好，我們獲得了papelist 的tag對象，通過這個tag對象我們能夠找出尾頁tag a對象，從這個tag a解析出對應的href屬性，獲得尾頁的編號12，然后自己拼出所有page頁面的訪問url來，并保存在pageUrlList中返回。page頁面的訪問url形式示例如下：

> page 1: http://blog.csdn.net/kesalin/article/list/1

根據page來獲取文章鏈接的代碼：

   def getArticleList(url):
      # 獲取所有的文章的 url/title
      pageUrlList = getPageUrlList(url)
  
      articleListDocs = []

      strPage = " > parsing page {0}"
      pageNum = 0
      global gRetryCount
      for pageUrl in pageUrlList:
          retryCount = 0
          pageNum = pageNum + 1
          pageNumStr = strPage.format(pageNum)
          print pageNumStr

          while retryCount <= gRetryCount:
              try:
                  retryCount = retryCount + 1
                  time.sleep(1.0) #訪問太快會不響應
                  request = urllib2.Request(pageUrl, None, header)
                  response = urllib2.urlopen(request)
                  data = response.read().decode('UTF-8')
  
                  #print data
                  soup = BeautifulSoup(data)
  
                  topArticleDocs = soup.find_all(id="article_toplist")
                  articleDocs = soup.find_all(id="article_list")
                  articleListDocs = articleListDocs + topArticleDocs + articleDocs
                  break
              except Exception, e:
                  print "getArticleList exception:%s, url:%s, retry count:%d" % (e, pageUrl, retryCount)
                  pass
  
      artices = []
      topTile = "[置頂]"
      for articleListDoc in articleListDocs:
          linkDocs = articleListDoc.find_all("span", "link_title")
          for linkDoc in linkDocs:
              #print linkDoc.prettify().encode('UTF-8')
              link = linkDoc.a
              url = link["href"].encode('UTF-8')
              title = link.get_text().encode('UTF-8')
              title = title.replace(topTile, '').strip()
              oneHref = "http://blog.csdn.net" + url
              #log("   > title:" + title + ", url:" + oneHref)
              artices.append([oneHref, title])

      log("total articles: " + str(len(artices)) + "\n")
      return artices

從第一步獲得所有的page鏈接保存在pageUrlList中，接下來就根據這些page 頁面來獲取對應page的article鏈接和標題。關鍵代碼是下面這三行：

topArticleDocs = soup.find_all(id="article_toplist")
articleDocs = soup.find_all(id="article_list")
articleListDocs = articleListDocs + topArticleDocs + articleDocs

從page的html內容中查找置頂的文章（article_toplist）以及普通的文章（article_list）的tag對象，然后將這些tag保存到articleListDocs中。

article_toplist示例：(article_list的格式是類似的)

<div id="article_toplist" class="list">
    <div class="list_item article_item">
        <div class="article_title">
            <span class="ico ico_type_Original"></span>
            <h1>
                <span class="link_title">
                <a href="/kesalin/article/details/10474007">
                <font color="red">[置頂]</font>
                招聘：有興趣做一個與Android對等的操作系統么？
                </a>
                </span>
            </h1>
        </div>
        ... ...
    </div>
    ... ...
</div>

然后遍歷所有的保存到articleListDocs里的tag對象，從中解析出link_title的span tag對象保存到linkDocs中；然后從中解析出鏈接的url和標題，這里去掉了置頂文章標題中的“置頂”兩字；最后將url和標題保存到artices列表中返回。artices列表中的每一項內容示例：

title:招聘：有興趣做一個與Android對等的操作系統么？
url:http://blog.csdn.net/kesalin/article/details/10474007

根據文章鏈接獲取文章html內容并解析轉換為Markdown文本

   def download(url, output):
      # 下載文章，并保存為 markdown 格式
      log(" >> download: " + url)

      data = None
      title = ""
      categories = ""
      content = ""
      postDate = datetime.datetime.now()
  
      global gRetryCount
      count = 0
      while True:
          if count >= gRetryCount:
              break
          count = count + 1
          try:
              time.sleep(2.0) #訪問太快會不響應
              request = urllib2.Request(url, None, header)
              response = urllib2.urlopen(request)
              data = response.read().decode('UTF-8')
              break
          except Exception,e:
              exstr = traceback.format_exc()
              log(" >> failed to download " + url + ", retry: " + str(count) + ", error:" + exstr)
              pass

      if data == None:
          log(" >> failed to download " + url)
          return

      #print data
      soup = BeautifulSoup(data)

      topTile = "[置頂]"
      titleDocs = soup.find_all("div", "article_title")
      for titleDoc in titleDocs:
          titleStr = titleDoc.a.get_text().encode('UTF-8')
          title = titleStr.replace(topTile, '').strip()
          #log(" >> title: " + title)

      manageDocs = soup.find_all("div", "article_manage")
      for managerDoc in manageDocs:
          categoryDoc = managerDoc.find_all("span", "link_categories")
          if len(categoryDoc) > 0:
              categories = categoryDoc[0].a.get_text().encode('UTF-8').strip()
  
          postDateDoc = managerDoc.find_all("span", "link_postdate")
          if len(postDateDoc) > 0:
              postDateStr = postDateDoc[0].string.encode('UTF-8').strip()
              postDate = datetime.datetime.strptime(postDateStr, '%Y-%m-%d %H:%M')

      contentDocs = soup.find_all(id="article_content")
      for contentDoc in contentDocs:
          htmlContent = contentDoc.prettify().encode('UTF-8')
          content = htmlContent2String(htmlContent)

      exportToMarkdown(output, postDate, categories, title, content)

同前面的分析類似，在這里通過訪問具體文章頁面獲得html內容，從中解析出文章標題，分類，發表時間，文章內容信息。然后把這些內容傳遞給函數exportToMarkdown，在其中生成相應的Markdown文本文件。值得一提的是，在解析文章內容信息時，由于html文檔內容有一些特殊的標簽或轉義符號，需要作特殊處理，這些特殊處理在函數htmlContent2String中進行。目前只導出了所有的文本內容，圖片，url鏈接以及表格都沒有處理，后續我會盡量完善這些轉換。

   def htmlContent2String(contentStr):
      patternImg = re.compile(r'(<img.+?src=")(.+?)(".+ />)')
      patternHref = re.compile(r'(<a.+?href=")(.+?)(".+?>)(.+?)(</a>)')
      patternRemoveHtml = re.compile(r'</?[^>]+>')

      resultContent = patternImg.sub(r'![image_mark](\2)', contentStr)
      resultContent = patternHref.sub(r'[\4](\2)', resultContent)
      resultContent = re.sub(patternRemoveHtml, r'', resultContent)
      resultContent = decodeHtmlSpecialCharacter(resultContent)
      return resultContent

目前僅僅是刪除所有的html標簽，并在函數decodeHtmlSpecialCharacter中轉換轉義字符。

生成Markdown文本文件

   def exportToMarkdown(exportDir, postdate, categories, title, content):
      titleDate = postdate.strftime('%Y-%m-%d')
      contentDate = postdate.strftime('%Y-%m-%d %H:%M:%S %z')
      filename = titleDate + '-' + title
      filename = repalceInvalidCharInFilename(filename)
      filepath = exportDir + '/' + filename + '.markdown'
      log(" >> save as " + filename)

      newFile = open(unicode(filepath, "utf8"), 'w')
      newFile.write('---' + '\n')
      newFile.write('layout: post' + '\n')
      newFile.write('title: \"' + title + '\"\n')
      newFile.write('date: ' + contentDate + '\n')
      newFile.write('comments: true' + '\n')
      newFile.write('categories: [' + categories + ']' + '\n')
      newFile.write('tags: [' + categories + ']' + '\n')
      newFile.write('description: \"' + title + '\"\n')
      newFile.write('keywords: ' + categories + '\n')
      newFile.write('---' + '\n\n')
      newFile.write(content)
      newFile.write('\n')
      newFile.close()

生成Markdown文本文件就很簡單了，在這里我需要生成github page用的Markdown博文形式，所以內容如此，你可以根據你的需要修改為其他形式的文本內容。

posted @ 2014-10-19 07:54 飄飄白云閱讀(1888) 評論(1) 收藏舉報

刷新頁面返回頂部