拼音輸入法開發(fā)告一段落

語料：wiki_zh 1.2GB

詞典：74001行 sysdic，其中字約17,000個，來自googlepinyin. 不是所有的GB18030漢字都能用UNICODE表示。GB18030采用單/雙/四字節(jié)混合編碼。

詞是googlepinyin+sunpinyin合并去重。喜羊羊與灰太狼之兔年頂呱呱 74083 xi'yang'yang'yu'hui'tai'lang'zhi'tu'nian'ding'gua'gua

mmseg時只用了字。

118M lm_sc.t3g

This is a 3-gram back-off model, using -log(pr)
1 items in 0-level
10876 items in 1-level
1945235 items in 2-level # ** 0.5 = 1395
12444533 items in 3-level # ** 1/3 = 232 能說有1200萬詞不？

訓(xùn)練時間：不到半小時。

效果：輸入自然流暢。輸入shuruzilanliuchang，輸入自然流暢是首選。

TODO:

往userdict里加詞太慢，不斷copy數(shù)據(jù)庫，改成加詞前自己備份文件，加詞時不copy. [job done]
把縣可讀xuan之流刪掉。把sysdic拆成了zi, zi.多音和詞，拼接起來OK. 533個多音字。The IME is on fire, 簡直輸入啥都有嘛。

我寫的極亂的程序：

#!/usr/bin/python

d = {}
def merge_all_w(f, s):
  d[f[0]] = ' '.join(f[1:])

def get_all_g_w(f, s):
  w = f[0]
  if len(w) <= 1: return
  f[1:] = f[3:]
  print(' '.join(f))

def get_all_s_w(f, s):
  w = f[0]
  if len(w) <= 1: return
  f[1:] = f[2:]
  print(' '.join(f))

d = {}
def get_g_zi (f, s):
  w = f[0]
  if len(w) > 1: return
  d.setdefault(w, []).append(f[3])

d = {}
def sort_g_by_freq (f, s): # 按詞頻降序排列
  w = f[0]
  if len(w) <= 1: return
  freq = int(float(f[1]))
  f[1:] = f[3:]
  d.setdefault(freq, []).append(' '.join(f))

wid = 16563
def get_g_23 (f, s): # 高頻詞里的二三字詞
  global wid
  xx = len(f[0])
  if xx != 2 and xx != 3: return
  f[1] = str(wid); wid += 1
  f[2:] = f[3:]
  print(' '.join(f))

def get_s_zi (f, s): # 字
  if int(f[1]) >= 100 and len(f[0]) == 1:
    f[1:] = f[2:]; print(' '.join(f))

def all_minus_sys(f, s):
  if not f[0] in st: print(s)

def sys_dic_pie(f, s):
  if int(f[1]) > 100 and len(f[0]) > 1:
    print(f[0], f[1], "'".join(f[2:]))
  else: print(s)

wid = 58005
def usr_dic_pie(f, s):
  global wid
  print(f[0], wid, "'".join(f[1:]))
  wid += 1

def do_ (cb):
  try:
    while True: s = input(); cb(s.split(), s)
  except EOFError: pass
  except Exception as e : print('ERROR:', e)

do_(usr_dic_pie)

'''
do_(sys_dic_pie)
st = set()
for s in open('/t/sysdic', 'r'): st.add(s.split()[0])
do_(all_minus_sys)
do_(merge_all_w)
for k,v in d.items(): print(f'{k} {v}')
do_(get_all_s_w)
do_(get_all_g_w)
do_(get_g_23)
do_(get_g_zi)
n = 100
for k,v in d.items(): print(f'{k} {n}', ' '.join(v)); n += 1
#噷 16562 hm
do_(sort_g_by_freq)
for k in sorted(d.keys(), reverse=True):
  print('\n'.join(d[k]))

do_(get_s_zi)
'''
# grep -v '%' 多音字

View Code

SConstructs里cflags = '-g -Wall'，CFLAGS=cflags, CXXFLAGS='-std=c++11', 可是編譯.cpp用的是CXXFLAGS。亂改成了：env.MergeFlags(['-pipe -O -DHAVE_CONFIG_H',

scons -c就像make clean.

SConstructs就是個Python程序，不必學(xué)autoconf和automake了。

原計(jì)劃搞臺16GB內(nèi)存的電腦，現(xiàn)在沒必要了。

某《電腦市場》版，還討論nvme SSD通過pci-e卡轉(zhuǎn)接呢——消費(fèi)真是降級了。

posted @ 2025-10-30 16:49 華容道專家閱讀(3) 評論(0) 收藏舉報(bào)

刷新頁面返回頂部

Penilum meum pullo sententia Latin a est 「通過浪費(fèi)時間獲得快樂」

拼音輸入法開發(fā)告一段落