關(guān)于 HashCode做key的可能性
最近在設(shè)計(jì)一個(gè)分布式的key-value系統(tǒng)的時(shí)候中,出于性能和存儲(chǔ)空間的考慮,準(zhǔn)備把string類型的key替換為它的HashCode值.
GetHashCode這個(gè)方法可能很多人都有所了解,不熟悉的可以看看這里:http://msdn.microsoft.com/zh-cn/library/system.object.gethashcode.aspx
以下信息只限于String.GetHashCode,其他的例如Object.GetHashCode根據(jù)其他對(duì)象的實(shí)現(xiàn)不同而不同:
1.對(duì)于不同的對(duì)象類型,或者同類型的不同值,返回值是可能重復(fù)的
2.String.GetHashCode的實(shí)現(xiàn)是平臺(tái)相關(guān)的,32位版本和64位版本并不一樣
3.String.GetHashCode的實(shí)現(xiàn)和.net版本有關(guān),將來(lái)可能還會(huì)改變
4.同樣的程序集,同樣的平臺(tái),同樣的字符串, 返回同樣的HashCode
5.基于默認(rèn)實(shí)現(xiàn),雖然對(duì)象是會(huì)重復(fù)的,但是由于hashcode是int32類型并且實(shí)現(xiàn)的比較良好,那么只有當(dāng)數(shù)據(jù)量達(dá)到或者接近2^32數(shù)量級(jí)的時(shí)候才需要考慮重復(fù)的情況
- 2^32次方大約在40多億
- 假設(shè)有1千萬(wàn)數(shù)據(jù),那么有一條或者多余一條數(shù)據(jù)重復(fù)的可能性約為400分之一 (這個(gè)概率對(duì)于一個(gè)要求較高的系統(tǒng)來(lái)說(shuō)太危險(xiǎn)了,而且現(xiàn)在很多系統(tǒng)的數(shù)據(jù)是遠(yuǎn)大于千萬(wàn)級(jí)別的)
6. .net4.0里面String.GetHashCode的實(shí)現(xiàn)代碼如下
// Gets a hash code for this string. If strings A and B are such that A.Equals(B), then
// they will return the same hash code.
[System.Security.SecuritySafeCritical] // auto-generated
[ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail)]
public override int GetHashCode() {
unsafe {
fixed (char *src = this) {
Contract.Assert(src[this.Length] == '\0', "src[this.Length] == '\\0'");
Contract.Assert( ((int)src)%4 == 0, "Managed string should start at 4 bytes boundary");
#if WIN32
int hash1 = (5381<<16) + 5381;
#else
int hash1 = 5381;
#endif
int hash2 = hash1;
#if WIN32
// 32bit machines.
int* pint = (int *)src;
int len = this.Length;
while(len > 0) {
hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];
if( len <= 2) {
break;
}
hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ pint[1];
pint += 2;
len -= 4;
}
#else
int c;
char *s = src;
while ((c = s[0]) != 0) {
hash1 = ((hash1 << 5) + hash1) ^ c;
c = s[1];
if (c == 0)
break;
hash2 = ((hash2 << 5) + hash2) ^ c;
s += 2;
}
#endif
#if DEBUG
// We want to ensure we can change our hash function daily.
// This is perfectly fine as long as you don't persist the
// value from GetHashCode to disk or count on String A
// hashing before string B. Those are bugs in your code.
hash1 ^= ThisAssembly.DailyBuildNumber;
#endif
return hash1 + (hash2 * 1566083941);
}
}
}
考慮到GetHashCode依賴.net版本和平臺(tái),還有重復(fù)概率..還是放棄了用HashCode來(lái)代替Key的想法
最終選擇使用md5 (16 bytes)
posted on 2011-12-06 11:17 聽(tīng)說(shuō)讀寫 閱讀(3446) 評(píng)論(1) 收藏 舉報(bào)
浙公網(wǎng)安備 33010602011771號(hào)