<output id="qn6qe"></output>

    1. <output id="qn6qe"><tt id="qn6qe"></tt></output>
    2. <strike id="qn6qe"></strike>

      亚洲 日本 欧洲 欧美 视频,日韩中文字幕有码av,一本一道av中文字幕无码,国产线播放免费人成视频播放,人妻少妇偷人无码视频,日夜啪啪一区二区三区,国产尤物精品自在拍视频首页,久热这里只有精品12

      R語言字符串相似度 stringdist包

      計算字符串相似度可以使用utils包中的adist函數,或者MKmisc包中的stringdist函數,或者RecordLinkage包中也有如jarowinkler之類的距離函數。本文介紹stringdist包中的stringdist函數和stringdistmatrix函數。
      stringdist包作者是 Mark der Loo
      stringdist用于計算對象a,b中的字符串兩兩之間的相似度,對于一個對象中的元素少于另一個的情況,采用循環補齊機制。stringdistmatrix的出相似度矩陣,其中采用a中的行,b中的列。

      stringdist(a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = Inf, q = 1, p = 0, nthread = getOption("sd_num_thread"))

      stringdistmatrix(a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = Inf, q = 1, p = 0, useNames = c("none", "strings", "names"), ncores = 1, cluster = NULL, nthread = getOption("sd_num_thread"))
      1
      2
      3
      參數:
      a,b: 字符串類型的目標對象
      method:距離計算方法,默認為“osa”,可以設置為jaccard,hamming,jarowinkler等方法。
      useBytes:以字節為單位進行比較
      weight:權值必須為正并且不超過1
      maxDist:最大距離限制
      q:在使用method=’qgram’, ‘jaccard’ 或 ‘cosine’的時候設置,必須為非負數
      p:jarowinkler距離的懲罰因子,默認為0,在0-0.25之間取值
      nThread:最大線程數
      useNames:輸出的行、列名使用輸入變量的行、列名
      ncores:核心數
      cluster:自定義集群數

      案例:

      > stringdistmatrix(c("foo","bar","boo"),c("baz","buz"))
      [,1] [,2]
      [1,] 3 3
      [2,] 1 2
      [3,] 2 2

      > # string distance matching is case sensitive:
      > stringdist("ABC","abc")
      [1] 3
      >
      > # so you may want to normalize a bit:
      > stringdist(tolower("ABC"),"abc")
      [1] 0
      >
      > # stringdist recycles the shortest argument:
      > stringdist(c('a','b','c'),c('a','c'))
      Warning message: longer object length is not a multiple of shorter object length
      [1] 0 1 1
      >
      > # different edit operations may be weighted; e.g. weighted substitution:
      > stringdist('ab','ba',weight=c(1,1,1,0.5))
      [1] 0.5
      >
      > # Non-unit weights for insertion and deletion makes the distance metric asymetric
      > stringdist('ca','abc')
      [1] 3
      > stringdist('abc','ca')
      [1] 3
      > stringdist('ca','abc',weight=c(0.5,1,1,1))
      [1] 2
      > stringdist('abc','ca',weight=c(0.5,1,1,1))
      [1] 2.5

      > # q-grams are based on the difference between occurrences of q consecutive characters
      > # in string a and string b.
      > # Since each character abc occurs in 'abc' and 'cba', the q=1 distance equals 0:
      > stringdist('abc','cba',method='qgram',q=1)
      [1] 0
      >
      > # since the first string consists of 'ab','bc' and the second
      > # of 'cb' and 'ba', the q=2 distance equals 4 (they have no q=2 grams in common):
      > stringdist('abc','cba',method='qgram',q=2)
      [1] 4

      > stringdist('MARTHA','MATHRA',method='jw')
      [1] 0.08333333
      > # Note that stringdist gives a _distance_ where wikipedia gives the corresponding
      > # _similarity measure_. To get the wikipedia result:
      > 1 - stringdist('MARTHA','MATHRA',method='jw')
      [1] 0.9166667
      >
      > # The corresponding Jaro-Winkler distance can be computed by setting p=0.1
      > stringdist('MARTHA','MATHRA',method='jw',p=0.1)
      [1] 0.06666667
      > # or, as a similarity measure
      > 1 - stringdist('MARTHA','MATHRA',method='jw',p=0.1)
      [1] 0.9333333
      >
      > # This gives distance 1 since Euler and Gauss translate to different soundex codes.
      > stringdist('Euler','Gauss',method='soundex')
      [1] 1
      > # Euler and Ellery translate to the same code and have distance 0
      > stringdist('Euler','Ellery',method='soundex')
      [1] 0
      >
      ————————————————

      函數 Levenshtein編輯距離.可以將其轉換為相似度指標,例如1-(Levenshtein編輯距離/更長的字符串長度).

      RecordLinkage 包中的levenshteinSim函數也可以直接執行此操作,并且可能比adist快. 

      library(RecordLinkage)
      > levenshteinSim("apple", "apple")
      [1] 1
      > levenshteinSim("apple", "aaple")
      [1] 0.8
      > levenshteinSim("apple", "appled")
      [1] 0.8333333
      > levenshteinSim("appl", "apple")
      [1] 0.8
      

      ETA:有趣的是,雖然RecordLinkage軟件包中的levenshteinDist似乎比adist略快,但levenshteinSim卻比任何一個都慢.使用 rbenchmark 包:

      > benchmark(levenshteinDist("applesauce", "aaplesauce"), replications=100000)
                                               test replications elapsed relative
      1 levenshteinDist("applesauce", "aaplesauce")       100000   4.012        1
        user.self sys.self user.child sys.child
      1     3.583    0.452          0         0
      > benchmark(adist("applesauce", "aaplesauce"), replications=100000)
                                     test replications elapsed relative user.self
      1 adist("applesauce", "aaplesauce")       100000   4.277        1     3.707
        sys.self user.child sys.child
      1    0.461          0         0
      > benchmark(levenshteinSim("applesauce", "aaplesauce"), replications=100000)
                                              test replications elapsed relative
      1 levenshteinSim("applesauce", "aaplesauce")       100000   7.206        1
        user.self sys.self user.child sys.child
      1      6.49    0.743          0         0
      

      此開銷僅是由于levenshteinSim的代碼造成的,它只是levenshteinDist的包裝:

      > levenshteinSim
      function (str1, str2) 
      {
          return(1 - (levenshteinDist(str1, str2)/pmax(nchar(str1), 
              nchar(str2))))
      }
      

      僅供參考:如果您始終比較兩個字符串而不是向量,則可以創建一個使用max而不是pmax的新版本,并將運行時間節省約25%:

      mylevsim = function (str1, str2) 
      {
          return(1 - (levenshteinDist(str1, str2)/max(nchar(str1), 
              nchar(str2))))
      }
      > benchmark(mylevsim("applesauce", "aaplesauce"), replications=100000)
                                        test replications elapsed relative user.self
      1 mylevsim("applesauce", "aaplesauce")       100000   5.608        1     4.987
        sys.self user.child sys.child
      1    0.627          0         0
      

      長話短說,adistlevenshteinDist在性能上幾乎沒有區別,盡管如果您不想添加軟件包依賴項,則前者是更可取的.如何將其轉換為相似性指標確實會對性能產生一些影響.

      posted @ 2022-01-19 23:47  MRO物料采購服務  閱讀(1291)  評論(0)    收藏  舉報
      主站蜘蛛池模板: 亚洲色精品vr一区区三区| 中文字幕有码无码人妻在线| 亚洲欧美成人综合久久久| 国产精品大全中文字幕| 在线无码中文字幕一区| 日韩人妻精品中文字幕专区 | 中文字幕国产精品自拍| 乱人伦人妻中文字幕在线| 中国china露脸自拍性hd| 蜜桃av色偷偷av老熟女| 亚洲一区二区av高清| 久久精品A一国产成人免费网站| 亚洲精品成人福利网站| 日韩美女亚洲性一区二区| 国产在线精品中文字幕| 亚洲熟妇少妇任你躁在线观看无码| 日韩秘 无码一区二区三区| 老司机性色福利精品视频| 国产大尺度一区二区视频| 亚洲熟妇少妇任你躁在线观看无码| 小污女小欲女导航| 久久99热成人精品国产| 国产三级视频网站| 亚洲色成人网站www永久四虎| 亚洲国产欧美一区二区好看电影| 汤原县| 国产成人精品亚洲午夜| 色噜噜久久综合伊人一本| 无码AV无码免费一区二区| 国产蜜臀一区二区在线播放| 少妇久久久被弄到高潮| 国产精品一区二区三区激情 | 精品无码国产日韩制服丝袜| 日本少妇xxx做受| 国产精品aⅴ免费视频| gogo无码大胆啪啪艺术| 99精品国产高清一区二区麻豆| 国产免费网站看v片元遮挡| 国产精品久久久午夜夜伦鲁鲁 | 99国精品午夜福利视频不卡99| 亚洲日韩日本中文在线|