python技巧31[unicode和bytes]

一 Python3 中字符串的類(lèi)型

bytearray([source[, encoding[, errors]]]): Return a new array of bytes. The bytearray type is a mutable sequence of integers in the range 0 <= x < 256.

bytes([source[, encoding[, errors]]]): Return a new “bytes” object, which is an immutable sequence of integers in the range 0 <= x < 256. bytes is an immutable version of bytearray.

str([object[, encoding[, errors]]]): Return a string version of an object. str默認(rèn)為unicode的字符串。

貌似也沒(méi)有了2.x中的basestring類(lèi)型了。

二實(shí)例

# -*- coding: gbk -*-

def TestisStrOrUnicdeOrString():
  bs = b'Hello'
  ustr = 'abc'
  print (isinstance(bs, str))  #False
  print (isinstance(bs,bytes)) #True
  print (isinstance(ustr,str)) #True
  print (isinstance(ustr, bytes)) #False
  print (isinstance(bs,(bytes,str))) #True

def TestChinese():
  us = '中國(guó)'
  bs = b'AAA'
  bs2 = bytes('中國(guó)','gbk')

  print (us + ':' + str(type(us))) #中國(guó):<class 'str'>
  print (bs) #b'AAA'
  print (bs2) # b'\xd6\xd0\xb9\xfa'
  print (':' + str(type(bs2))) #:<class 'bytes'>
  print (bs2.decode('gbk')) #中國(guó)

  # TypeError: Can't convert 'bytes' object to str implicitly
  #newstr = us + bs2

  print ('us == bs2' + ':' + str(us == bs2)) #us == bs2:False

  s3 = 'AAA中國(guó)'
  print (s3) # AAA中國(guó)

  s4 = bytes('AAA中國(guó)','gbk')
  print (s4) # b'AAA\xd6\xd0\xb9\xfa'

def TestPrint():
  print ('AAA' + '中國(guó)')  # AAA中國(guó)
  #print (b'AAA' + b'中國(guó)') #  # SyntaxError: bytes can only contain ASCII literal characters.
  #print ('AAA' + bytes('中國(guó)','gbk')) # TypeError: Can't convert 'bytes' object to str implicitly

def TestCodecs():
    import codecs

    look  = codecs.lookup("gbk")

    a = bytes("北京",'gbk')

    print (len(a), a, type(a)) #4 b'\xb1\xb1\xbe\xa9' <class 'bytes'>

    b = look.decode(a)
    print (b[1], b[0], type(b[0])) #4 北京 <class 'str'>


if __name__ == '__main__':
    TestisStrOrUnicdeOrString()
    TestChinese()
    TestPrint()
    TestCodecs()

三總結(jié)

1） Python 3會(huì)假定我們的源碼 — 即.py文件 — 使用的是UTF-8編碼方式。Python 2里，.py文件默認(rèn)的編碼方式為ASCII。可以使用# -*- coding: windows-1252 -*-方式來(lái)改變文件的編碼。如果py文件中包含中文的字符串，則需要制定為# -*- coding: gbk -*-，貌似默認(rèn)的utf8不夠哦。

2） python3中默認(rèn)的str為unicode的，可以使用str.encode來(lái)轉(zhuǎn)為bytes類(lèi)型。

3） python3的print函數(shù)只支持unicode的str，貌似沒(méi)有對(duì)bytes的解碼功能，所以對(duì)對(duì)不能解碼的bytes不能正確輸出。

4） str和bytes不能連接和比較。

5） codecs任然可以用來(lái)str和bytes間的轉(zhuǎn)化。

6）定義非ascii碼的bytes時(shí)，必須使用如 bytes('中國(guó)','gbk') 來(lái)轉(zhuǎn)碼。

7)貌似必須在中文系統(tǒng)或者系統(tǒng)安裝中文的語(yǔ)言包后gbk解碼才能正常工作。

python 2.6 的字符及編碼轉(zhuǎn)化見(jiàn) ：http://www.rzrgm.cn/itech/archive/2011/03/27/1996883.html

完！

posted @ 2011-03-28 17:38 iTech 閱讀(18581) 評(píng)論(1) 收藏舉報(bào)

刷新頁(yè)面返回頂部

iTech's Blog

持續(xù)集成微信公眾號(hào)cicdops www.cicdops.com

python技巧31[unicode和bytes]

公告