python技巧26[str+unicode+codecs]

一 python2.6中的字符串

1) 字符串的種類和關系 (在2.x中，默認的string為str)

2） python的全局函數中basestring，str和unicode的描述如下

basestring()
This abstract type is the superclass for str and unicode. It cannot be called or instantiated, but it can be used to test whether an object is an instance of str or unicode. isinstance(obj, basestring) is equivalent to isinstance(obj, (str, unicode)).

str([object])
Return a string containing a nicely printable representation of an object. For strings, this returns the string itself. The difference with repr(object) is that str(object) does not always attempt to return a string that is acceptable to eval(); its goal is to return a printable string. If no argument is given, returns the empty string, ''.

unicode([object[, encoding[, errors]]])
Return the Unicode string version of object using one of the following modes:

If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised. Error handling is done according to errors; this specifies the treatment of characters which are invalid in the input encoding. If errors is 'strict' (the default), a ValueError is raised on errors, while a value of 'ignore' causes errors to be silently ignored, and a value of 'replace' causes the official Unicode replacement character, U+FFFD, to be used to replace input characters which cannot be decoded. See also the codecs module.

If no optional parameters are given, unicode() will mimic the behaviour of str() except that it returns Unicode strings instead of 8-bit strings. More precisely, if object is a Unicode string or subclass it will return that Unicode string without any additional decoding applied.

For objects which provide a __unicode__() method, it will call this method without arguments to create a Unicode string. For all other objects, the 8-bit string version or representation is requested and then converted to a Unicode string using the codec for the default encoding in 'strict' mode.

二 print

2.6中print函數的幫助：(print()函數基本等價于print ‘’ 語句)

print([object, ...][, sep=' '][, end='n'][, file=sys.stdout])
Print object(s) to the stream file, separated by sep and followed by end. sep, end and file, if present, must be given as keyword arguments.

All non-keyword arguments are converted to strings like str() does and written to the stream, separated by sep and followed by end. Both sep and end must be strings; they can also be None, which means to use the default values. If no object is given, print() will just write end.

The file argument must be an object with a write(string) method; if it is not present or None, sys.stdout will be used.

Note
This function is not normally available as a builtin since the name print is recognized as the print statement. To disable the statement and use the print() function, use this future statement at the top of your module:

from __future__ import print_function

print 函數支持str和unicode。 python的print會對輸出的文本做自動的編碼轉換，而文件對象的write方法就不會做，例如如下代碼中包含中英文，但是能夠正確的輸出：

def TestPrint():
print 'AAA' + '中國' # AAA中國
#print u'AAA' + u'中國' # SyntaxError: (unicode error) 'utf8' codec can't decode bytes in
print u'AAA' + unicode('中國','gbk') # AAA中國

三 codecs

函數 decode( char_set )可以實現其它編碼到 Unicode 的轉換，函數 encode( char_set )實現 Unicode 到其它編碼方式的轉換。

codecs模塊為我們解決的字符編碼的處理提供了lookup方法，它接受一個字符編碼名稱的參數，并返回指定字符編碼對應的 encoder、decoder、StreamReader和StreamWriter的函數對象和類對象的引用。為了簡化對lookup方法的調用， codecs還提供了getencoder(encoding)、getdecoder(encoding)、getreader(encoding)和 getwriter(encoding)方法；進一步，簡化對特定字符編碼的StreamReader、StreamWriter和 StreamReaderWriter的訪問，codecs更直接地提供了open方法，通過encoding參數傳遞字符編碼名稱，即可獲得對 encoder和decoder的雙向服務。

#-*- encoding: gb2312 -*-
import codecs, sys

# 用codecs提供的open方法來指定打開的文件的語言編碼，它會在讀取的時候自動轉換為內部unicode
bfile = codecs.open("dddd.txt", 'r', "big5")
#bfile = open("dddd.txt", 'r')

ss = bfile.read()
bfile.close()
# 輸出，這個時候看到的就是轉換后的結果。如果使用語言內建的open函數來打開文件，這里看到的必定是亂碼
print ss, type(ss)
上面這個處理big5的，可以去找段big5編碼的文件試試。

四實例

代碼：

# -*- coding: utf-8 -*-

def TestisStrOrUnicdeOrString():
  s = 'abc'
  ustr = u'Hello'
  print isinstance(s, str)  #True
  print isinstance(s,unicode) #False
  print isinstance(ustr,str) #False
  print isinstance(ustr, unicode) #True
  print isinstance(s,basestring) #True
  print isinstance(ustr,unicode) #True

def TestChinese():
  # for the below chinese, must add '# -*- coding: utf-8 -*-' in first or second line of this file
  s = '中國'
  # SyntaxError: (unicode error) 'utf8' codec can't decode bytes in position 0-1
  # us = u'中國'
  us2 = unicode('中國','gbk')

  print (s + ':' + str(type(s))) #中國:<type 'str'>
  # print us
  print (us2 + ':' + str(type(us2))) #中國:<type 'unicode'>

  # UnicodeDecodeError: 'ascii' codec can't decode byte 0xd6
  #newstr = s + us2

  #UnicodeWarning: Unicode equal comparison failed to convert
  #both arguments to Unicode - interpreting them as being unequal
  #print 's == us2' + ':' + s == us2

s3 = 'AAA中國'
print s3 # AAA中國

s4 = unicode('AAA中國','gbk')
print s4 # AAA中國

def TestPrint():
print 'AAA' + '中國' # AAA中國

  #print u'AAA' + u'中國' # SyntaxError: (unicode error) 'utf8' codec can't decode bytes in
  print u'AAA' + unicode('中國','gbk') # AAA中國

def TestCodecs():
    import codecs

    look  = codecs.lookup("gbk")

    a = unicode("北京",'gbk')

    print len(a), a, type(a) #2 北京 <type 'unicode'>

    b = look.encode(a)
    print b[1], b[0], type(b[0]) #2 北京 <type 'str'>

if __name__ == '__main__':
    TestisStrOrUnicdeOrString()
    TestChinese()
    TestPrint()
    TestCodecs()

五總結

1）如果python文件中包含中文的字符串，必須在python文件的最開始包含# -*- coding: utf-8 -*-, 表示讓python以utf8格式來解析此文件；

2）使用isinstance(obj, basestring) is equivalent to isinstance(obj, (str, unicode))來判斷是否為字符串；

3）us = u'中國' 有錯誤，必須使用us2 = unicode('中國','gbk')來將中文解碼為正確的unicode字符串；

4）str和unicode字符串不能連接和比較；

5）print函數能夠支持str和unicode，且能夠正確的解碼和輸出字符串；

6）可以使用unicode.encode或str.decode來實現unicode和str的相互轉化，還可以使用codecs的encode和decode來實現轉化。

7)貌似必須在中文系統或者系統安裝中文的語言包后gbk解碼才能正常工作。

python3.1 的字符及編碼轉化見：http://www.rzrgm.cn/itech/archive/2011/03/28/1997878.html

參考：

http://anchoretic.blog.sohu.com/82278076.html

http://www.javaeye.com/topic/560229

完！

posted @ 2011-03-27 22:00 iTech 閱讀(9940) 評論(0) 收藏舉報

刷新頁面返回頂部

iTech's Blog

持續集成微信公眾號cicdops www.cicdops.com

python技巧26[str+unicode+codecs]

公告