python 中文亂碼解決方案
python 處理文字內(nèi)容時(shí),常常遇到編碼的問(wèn)題。
漢字常用的兩種編碼方式為 utf8 和 gbk,解析一個(gè) txt 文件或者一個(gè)字符串時(shí)經(jīng)常會(huì)遇到編碼問(wèn)題。
對(duì)于一行文本,我們分別嘗試用 utf8 或者 gbk 去解碼,哪一個(gè)解碼內(nèi)容多選擇哪一個(gè)
def force_decode(string:bytes) ->str: """ sometimes neither gbk nor gbk can decode succseefully from string select longger decode result from utf8 or gbk """ if not isinstance(string, bytes): raise ValueError('expected bytes array') decode_chars_count = [] for i in ['utf8', 'gbk']: try: return string.decode(i) except UnicodeDecodeError as ex: decode_chars_count.append(ex.start) # neither utf8 or gbk decode successfully # select the longer decode one utf8_len, gbk_len = decode_chars_count selected_encoding = 'utf8' if utf8_len > gbk_len else 'gbk' return string.decode(selected_encoding, errors='ignore')
代碼鏈接:https://gist.github.com/albertofwb/b53bf32adca5c245c6dee6642ca5463d

浙公網(wǎng)安備 33010602011771號(hào)