作業①:

爬取天氣數據

實驗要求：在中國氣象網（http://www.weather.com.cn）給定城市集的7日天氣預報，并保存在數據庫。
核心代碼與結果：

class WeatherForecast:
    def __init__(self):
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre"
        }
        self.cityCode = {"福州": "101230101"}

    def forecastCity(self, city):
        url = "http://www.weather.com.cn/weather/" + self.cityCode[city] + ".shtml"
        req = urllib.request.Request(url, headers=self.headers)
        data = urllib.request.urlopen(req).read()
        data = UnicodeDammit(data, ["utf-8", "gbk"]).unicode_markup
        soup = BeautifulSoup(data, "lxml")
        ul = soup.find("ul", class_="t clearfix")
        lis = ul.find_all("li", recursive=False)
        for li in lis:
            date = li.select('h1')[0].text
            weather = li.select('p[class="wea"]')[0].text
            temp = li.select('p[class="tem"] span')[0].text + "/" + li.select('p[class="tem"] i')[0].text
            print(city, date, weather, temp)
            self.db.insert(city, date, weather, temp)

    def process(self):
        self.db = WeatherDB()
        self.db.openDB()
        self.forecastCity("福州")
        self.db.closeDB()

使用 urllib.request.Request 構造請求對象，為了順利得到福州的天氣預報數據，在請求時，必須添加User-Agent請求頭來模擬瀏覽器，否則會請求失敗。然后拼出目標 URL，發送請求并讀取字節，并使用BeautifulSoup解析 HTML。

如圖，我先定位 7 日預報列表，然后遍歷它的直接子節點 li，一個li對應的就是一天，白天最高溫在 span，夜間最低溫在 i，拼成“高/低”，這樣就實現了對天氣預報數據的爬取。
心得：
這個任務的關鍵在于，需要準確定位天氣預報所需的準確信息，并且需要能把我們需要的信息準確的提取出來。同時也必須添加請求頭，如果不添加請求頭，我們會被服務器拒絕訪問。最后，編碼問題也是無法忽視的，本次實驗通過 BeautifulSoup 自帶的 UnicodeDammit 模塊解決了 gbk 和 utf-8 的自動檢測問題，這是一個非常高效且顯著的方法。

作業②

股票信息定向爬蟲實驗

實驗要求：用requests和BeautifulSoup庫方法定向爬取股票相關信息，并存儲在數據庫中。
核心代碼與結果：

URL = "https://push2.eastmoney.com/api/qt/clist/get"
HEADERS = {"User-Agent": "Mozilla/5.0"}

# f12=代碼, f14=名稱, f2=最新價, f3=漲跌幅(%), f4=漲跌額, f5=成交量(手), f6=成交額(元)
FIELDS = "f12,f14,f2,f3,f4,f5,f6"
FS_HS_A = "m:0+t:6,m:0+t:13"
if __name__ == "__main__":
    page_size = 50      # 每頁條數
    pages = 5           # 翻頁頁數
    rows = []           

    for p in range(1, pages + 1):
        # 關鍵查詢參數：
        params = {
            "pn": p, "pz": page_size, "po": 1, "np": 1, "fltt": 2,
            "fid": "f3", "fs": FS_HS_A, "fields": FIELDS, "_": int(time.time() * 1000),
        }
        r = requests.get(URL, params=params, headers=HEADERS, timeout=10)
        text = r.text.strip()
        
        # 兼容 JSONP
        if not text.startswith("{"):
            m = re.search(r"\((\{.*\})\)\s*$", text)
            text = m.group(1) if m else text

        data = json.loads(text)
        diff = (data.get("data") or {}).get("diff", []) or []

        for d in diff:
            # 從字典取值并做類型轉換
            code = d.get("f12"); name = d.get("f14")
            price = float(d.get("f2"))
            pct_chg = float(d.get("f3"))
            chg = float(d.get("f4"))
            vol = int(float(d.get("f5")))
            amount = float(d.get("f6"))

            if code:
                rows.append((code, name, price, pct_chg, chg, vol, amount))

    con = sqlite3.connect("stocks_min.db")
    cur = con.cursor()
    cur.execute("""
        CREATE TABLE IF NOT EXISTS stocks_min (
            code    TEXT PRIMARY KEY,
            name    TEXT,
            price   REAL,
            pct_chg REAL,
            chg     REAL,
            vol     INTEGER,
            amount  REAL
        );
    """)
    cur.executemany("""
        INSERT OR REPLACE INTO stocks_min
        (code, name, price, pct_chg, chg, vol, amount)
        VALUES (?, ?, ?, ?, ?, ?, ?);
    """, rows)
    con.commit()
    cur.execute("SELECT code,name,price,pct_chg,chg,vol,amount FROM stocks_min LIMIT 10;")
    for row in cur.fetchall():
        print(row)

    con.close()
    print(f"滬深A股 入庫 {len(rows)} 條 stocks_min.db")

打開東方財富網https://quote.eastmoney.com/center/gridlist.html#hs_a_board
右鍵檢查元素，進入網絡搜索api，從中我們可以找到對應的數據。

發給網站查詢參數，同時也要通過正則把JSON摳出來，從 JSON 數據中取出 data.diff 數組，然后提取出來每一條數據并塞入rows里，為之后進入數據庫準備。
心得：
網絡數據抓取不僅僅是能拿到數據，而且要拿到我們想要的數據，相比于全部數據一次性全爬取下來，這樣的效率是低下的，我們更應該分析我們真正需要什么，學會爬取api返回的json數據，這一步在財經網站反爬設計中也是關鍵。

作業③

爬取中國大學2021主榜實驗

實驗要求：
爬取中國大學2021主榜（https://www.shanghairanking.cn/rankings/bcur/2021）所有院校信息。
核心代碼與結果：

API = "https://www.shanghairanking.cn/api/pub/v1/bcur"
PARAMS = {"bcur_type": 11, "year": 2021} 
HEADERS = {
    "User-Agent": "Mozilla/5.0",
    "Referer": "https://www.shanghairanking.cn/rankings/bcur/2021",}

def fetch_rankings():
    r = requests.get(API, params=PARAMS, headers=HEADERS, timeout=20)
    r.raise_for_status()
    data = r.json()
    # 關鍵數據在 data.rankings
    lst = (data.get("data") or {}).get("rankings", [])
    rows = []
    for it in lst:
        rank = it.get("ranking") or it.get("rank")
        name = it.get("univNameCn") or it.get("univName")
        province = it.get("province")  # 省市
        utype = it.get("univCategory")  # 類型：綜合/理工/… 
        score = it.get("score")
        if rank and name:
            rows.append((int(rank), str(name), str(province or ""), str(utype or ""), float(score or 0)))
    return rows

我們需要通過F12確定具體的數據來源，使用 F12點開Network，發現接口。它返回 JSON 數據，鍵 data.rankings 存放所有高校排名信息。

首先，我們發出請求，但要記得帶請求頭，不然會拒絕訪問，接著從圖中可以看到完整 JSON 對象，然后從JSON中挑選出我們需要的東西，剩下的也是放入row，最后存入數據庫。
心得：
在本次任務中，更加深刻的體會到了該如何尋找api，在動態加載的網頁結構中，分析api會比解析html更加有效快捷。同時也要api也會比html更加穩定，html的class很可能會因為網站改版發生變化，這會導致我們爬蟲失敗。最后，我們也要注意在js文件中的名稱，如果設置不對也會導致爬取不到對應數據。

posted on 2025-11-03 20:42 林焜閱讀(14) 評論(0) 收藏舉報

刷新頁面返回頂部