上次我們分析了在附加屬性表這樣表結(jié)構(gòu)設(shè)計(jì)下的PostgreSQL查詢效率。由于PostgreSQL眾所周知的所謂“性能”問(wèn)題,所以有必要再用使用MyISAM引擎的MySQL再來(lái)實(shí)驗(yàn)一遍。在我們?cè)敿?xì)分析了兩種常見(jiàn)的開(kāi)源數(shù)據(jù)庫(kù)之后,話題將會(huì)進(jìn)一步引申到按行存儲(chǔ)的數(shù)據(jù)庫(kù)結(jié)構(gòu)以及索引對(duì)于查詢效率的影響。以下實(shí)驗(yàn)中的MySQL為MariaDB發(fā)行版本。還是從建表開(kāi)始:
MariaDB [veil]> show create table cc2;
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| cc2 | CREATE TABLE `cc2` (
`contact_id` int(11) NOT NULL,
`cad_id` int(11) NOT NULL,
`value` int(11) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1 |
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
MariaDB [veil]> select count(*) from cc2;
+----------+
| count(*) |
+----------+
| 5904385 |
+----------+
1 row in set (0.00 sec)
MySQL在表上緩存了count(*)的結(jié)果,所以查詢是不需要花費(fèi)時(shí)間的。再來(lái)創(chuàng)建必要的索引:
MariaDB [veil]> create index contact_id_idx on cc2(contact_id);
Query OK, 5904385 rows affected (3.11 sec)
Records: 5904385 Duplicates: 0 Warnings: 0
MariaDB [veil]> create index cad_id_value_idx on cc2(cad_id, value);
Query OK, 5904385 rows affected (8.17 sec)
Records: 5904385 Duplicates: 0 Warnings: 0
MariaDB [veil]> select count(*) from cc2 where cad_id = 101 and value = 5;
+----------+
| count(*) |
+----------+
| 998839 |
+----------+
1 row in set (0.18 sec)
我們可以看到,這速度那是剛剛的。比較PostgreSQL中的相同的查詢,速度要快上4倍。但是:
MariaDB [veil]> select count(contact_id) from cc2 where cad_id = 101 and value = 5;
+-------------------+
| count(contact_id) |
+-------------------+
| 998839 |
+-------------------+
1 row in set (0.41 sec)
這個(gè)查詢比前一個(gè)慢,是因?yàn)閏ount(*)是數(shù)返回行的rowid,而count(contact_id)是數(shù)真正的contact_id列,而這個(gè)列的值是不包含在cad_id_value_idx中的,如果創(chuàng)建更多的索引的話,速度就要更快一些了:
MariaDB [veil]> create index cad_id_value_contact_id_idx on cc2(cad_id, value, contact_id);
Query OK, 5904385 rows affected (13.37 sec)
Records: 5904385 Duplicates: 0 Warnings: 0
MariaDB [veil]> select count(contact_id) from cc2 where cad_id = 101 and value = 5;
+-------------------+
| count(contact_id) |
+-------------------+
| 998839 |
+-------------------+
1 row in set (0.21 sec)
這個(gè)在MySQL中被稱作covering index。大概PostgreSQL 9.2還沒(méi)發(fā)布的index only query也是這個(gè)意思吧,我猜的。
現(xiàn)在,讓我們來(lái)看看MySQL是否能夠在兩個(gè)條件的情況下表現(xiàn)得比PostgreSQL強(qiáng)。首先嘗試INTERSECT吧:
MariaDB [veil]> select count(*) from (select contact_id from cc as a1 where a1.cad_id = 101 and a1.value = 5 intersect select contact_id from cc as a2 where a2.cad_id = 102 and a2.value = 7) as temp;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'intersect select contact_id from cc as a2 where a2.cad_id = 102 and a2.value = 7' at line 1
我靠,MySQL居然不支持INTERSECT。由于我們知道IN + SUBQUERY肯定是更慢的,所以就只剩INNER JOIN這一種寫(xiě)法了。
MariaDB [veil]> select count(*) from cc as a1, cc as a2 where a1.contact_id = a2.contact_id and a1.cad_id = 101 and a1.value = 5 and a2.cad_id = 102 and a2.value = 7;
+----------+
| count(*) |
+----------+
| 164788 |
+----------+
1 row in set (6.56 sec)
MariaDB [veil]> select count(a1.contact_id) from cc as a1, cc as a2 where a1.contact_id = a2.contact_id and a1.cad_id = 101 and a1.value = 5 and a2.cad_id = 102 and a2.value = 7;
+----------------------+
| count(a1.contact_id) |
+----------------------+
| 164788 |
+----------------------+
1 row in set (6.67 sec)
你沒(méi)看錯(cuò),這結(jié)果就是這么慘不忍睹。它甚至比PostgreSQL用IN + SUBQUERY實(shí)現(xiàn)得還要慢。看看到底是啥狀況吧:
MariaDB [veil]> explain extended select count(a1.contact_id) from cc as a1, cc as a2 where a1.contact_id = a2.contact_id and a1.cad_id = 101 and a1.value = 5 and a2.cad_id = 102 and a2.value = 7;
+----+-------------+-------+------+------------------------------------+-------------------------+---------+--------------------------------+--------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------+------------------------------------+-------------------------+---------+--------------------------------+--------+----------+--------------------------+
| 1 | SIMPLE | a1 | ref | contact_id,cad_id_value_contact_id | cad_id_value_contact_id | 10 | const,const | 808542 | 100.00 | Using where; Using index |
| 1 | SIMPLE | a2 | ref | contact_id,cad_id_value_contact_id | cad_id_value_contact_id | 15 | const,const,veil.a1.contact_id | 1 | 100.00 | Using index |
+----+-------------+-------+------+------------------------------------+-------------------------+---------+--------------------------------+--------+----------+--------------------------+
2 rows in set, 1 warning (0.00 sec)
雖然不像PostgreSQL的分析結(jié)果那般詳細(xì)。但是至少還是可以看出,索引確實(shí)是被利用上了。所以我感覺(jué),MySQL真的不擅長(zhǎng)復(fù)雜的join。這就是問(wèn)題了,join處理不好,多條件的情況就沒(méi)法支持好了。
對(duì)于我們的第一次嘗試,可以總結(jié)出以下幾點(diǎn):
- 基于磁盤(pán)的方案行不通,必須放在內(nèi)存中
- 無(wú)條件:PostgreSQL慢,MySQL慢
- 一個(gè)條件:PostgreSQL快,MySQL更快(拜covering index所賜)
- 兩個(gè)條件:PostgreSQL慢,MySQL更慢。INNER JOIN稍慢于INTERSECT。
- 三個(gè)條件:PostgreSQL快于MySQL,而且INNER JOIN要快于INTERSECT。
總體來(lái)說(shuō),要把請(qǐng)求響應(yīng)時(shí)間控制在一秒之內(nèi)那難度是相當(dāng)?shù)拇蟀 K越K悸繁仨氉兓?。那我們?cè)賮?lái)看看把行變列會(huì)不會(huì)好一些吧:
MariaDB [veil]> show create table spike;
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| spike | CREATE TABLE `spike` (
`contact_id` int(11) NOT NULL,
`a1` int(11) DEFAULT NULL,
`a2` int(11) DEFAULT NULL,
`a3` int(11) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1 |
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
MariaDB [veil]> select * from spike limit 10;
+------------+------+------+------+
| contact_id | a1 | a2 | a3 |
+------------+------+------+------+
| 800001 | 4 | NULL | NULL |
| 800003 | 5 | 7 | 1 |
| 800004 | 3 | NULL | NULL |
| 800005 | 3 | NULL | NULL |
| 800006 | 5 | NULL | NULL |
| 800007 | NULL | 6 | NULL |
| 800008 | 5 | NULL | NULL |
| 800009 | 4 | 8 | NULL |
| 800011 | NULL | 7 | NULL |
| 800012 | 3 | 6 | 1 |
+------------+------+------+------+
10 rows in set (0.01 sec)
列a1來(lái)自于cad_id=101的值,a2來(lái)自于cad_id=102的值,a3來(lái)自于cad_id=6的值。
首先試試一個(gè)條件的性能吧:
MariaDB [veil]> select count(*) from spike where a1 = 5;
+----------+
| count(*) |
+----------+
| 998839 |
+----------+
1 row in set (0.26 sec)
速度不錯(cuò),即便我們還沒(méi)給a1加索引呢。這是因?yàn)镻ostgreSQL和MySQL這樣的按行存儲(chǔ)的數(shù)據(jù)庫(kù),在需要做sequential scan的時(shí)候,性能很大程度上決定于行的大小。如果我們有100行,這性能就肯定不會(huì)這么好了。不信的話可以試試:
MariaDB [veil]> alter table spike add column dummy1 TEXT;
Query OK, 4074980 rows affected (1.28 sec)
Records: 4074980 Duplicates: 0 Warnings: 0
MariaDB [veil]> select count(*) from spike where a1 = 5;
+----------+
| count(*) |
+----------+
| 998839 |
+----------+
1 row in set (0.64 sec)
不出意料,僅僅只是添加了一個(gè)TEXT字段,查詢就慢了這么多。為了克服按行存儲(chǔ)的查詢速度隨行的列數(shù)增加和越來(lái)越慢的問(wèn)題,索引變得特別必要:
MariaDB [veil]> create index a1_idx on spike(a1);
Query OK, 4074980 rows affected (4.97 sec)
Records: 4074980 Duplicates: 0 Warnings: 0
MariaDB [veil]> select count(*) from spike where a1 = 5;
+----------+
| count(*) |
+----------+
| 998839 |
+----------+
1 row in set (0.22 sec)
我們可以看到,查詢時(shí)間又變得正常了。因?yàn)閍1_idx是一個(gè)covering index,所以count無(wú)需真的讀取行,只需要查詢索引就可以知道有多少行了。而且MYISAM又不像PostgreSQL那樣,因?yàn)镸VCC還需要對(duì)原始行更新hint bit,所以整個(gè)查詢就和行的大小一點(diǎn)關(guān)系都沒(méi)有了。
MariaDB [veil]> alter table spike add column dummy2 TEXT;
Query OK, 4074980 rows affected (5.00 sec)
Records: 4074980 Duplicates: 0 Warnings: 0
MariaDB [veil]> alter table spike add column dummy3 TEXT;
Query OK, 4074980 rows affected (5.30 sec)
Records: 4074980 Duplicates: 0 Warnings: 0
MariaDB [veil]> select count(*) from spike where a1 = 5;
+----------+
| count(*) |
+----------+
| 998839 |
+----------+
1 row in set (0.23 sec)
對(duì)吧,沒(méi)騙你吧。
那么兩個(gè)條件如何?先把表結(jié)構(gòu)回到正常的狀態(tài),把索引都去掉:
MariaDB [veil]> alter table spike drop column dummy1;
Query OK, 4074980 rows affected (5.19 sec)
Records: 4074980 Duplicates: 0 Warnings: 0
MariaDB [veil]> alter table spike drop column dummy2;
Query OK, 4074980 rows affected (4.98 sec)
Records: 4074980 Duplicates: 0 Warnings: 0
MariaDB [veil]> alter table spike drop column dummy3;
Query OK, 4074980 rows affected (3.41 sec)
Records: 4074980 Duplicates: 0 Warnings: 0
MariaDB [veil]> alter table spike drop index a1_idx;
Query OK, 4074980 rows affected (0.46 sec)
Records: 4074980 Duplicates: 0 Warnings: 0
MariaDB [veil]> show create table spike;
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| spike | CREATE TABLE `spike` (
`contact_id` int(11) NOT NULL,
`a1` int(11) DEFAULT NULL,
`a2` int(11) DEFAULT NULL,
`a3` int(11) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1 |
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
接下來(lái)用兩個(gè)條件做查詢:
MariaDB [veil]> select count(*) from spike where a1 = 5 and a2 = 7;
+----------+
| count(*) |
+----------+
| 164788 |
+----------+
1 row in set (0.30 sec)
速度不錯(cuò)嘛!給a1創(chuàng)建一個(gè)索引,應(yīng)該速度會(huì)更快吧:
MariaDB [veil]> create index a1_idx on spike(a1);
Query OK, 4074980 rows affected (3.06 sec)
Records: 4074980 Duplicates: 0 Warnings: 0
MariaDB [veil]> select count(*) from spike where a1 = 5 and a2 = 7;
+----------+
| count(*) |
+----------+
| 164788 |
+----------+
1 row in set (0.45 sec)
尼瑪居然更慢了……原因不是索引沒(méi)有被使用,而是因?yàn)橛昧怂饕愿?/p>
MariaDB [veil]> explain select count(*) from spike where a1 = 5 and a2 = 7;
+----+-------------+-------+------+---------------+--------+---------+-------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+--------+---------+-------+---------+-------------+
| 1 | SIMPLE | spike | ref | a1_idx | a1_idx | 5 | const | 1420338 | Using where |
+----+-------------+-------+------+---------------+--------+---------+-------+---------+-------------+
1 row in set (0.01 sec)
a1索引確實(shí)被使用了,但是a2并沒(méi)有在同一個(gè)索引之中。所以這不是一個(gè)covering index,查詢就不能僅僅在索引中查詢了,還要回到原始的行存儲(chǔ)中去檢查a2=7。
MariaDB [veil]> create index a2_idx on spike(a2);
Query OK, 4074980 rows affected (5.71 sec)
Records: 4074980 Duplicates: 0 Warnings: 0
MariaDB [veil]> select count(*) from spike where a1 = 5 and a2 = 7;
+----------+
| count(*) |
+----------+
| 164788 |
+----------+
1 row in set (0.40 sec)
給a2_idx創(chuàng)建了一個(gè)單獨(dú)的索引也于事無(wú)補(bǔ),因?yàn)樵趻呙鑑1_idx的時(shí)候,還是沒(méi)有a2。查詢?nèi)匀恍枰氐皆嫉男写鎯?chǔ)中去獲得a2。
MariaDB [veil]> create index a1_a2_idx on spike(a1, a2);
Query OK, 4074980 rows affected (9.47 sec)
Records: 4074980 Duplicates: 0 Warnings: 0
MariaDB [veil]> select count(*) from spike where a1 = 5 and a2 = 7;
+----------+
| count(*) |
+----------+
| 164788 |
+----------+
1 row in set (0.06 sec)
現(xiàn)在,我們才看到什么叫做速度!它是完全基于索引的。
MariaDB [veil]> alter table spike add column dummy1 TEXT;
Query OK, 4074980 rows affected (12.33 sec)
Records: 4074980 Duplicates: 0 Warnings: 0
MariaDB [veil]> alter table spike add column dummy2 TEXT;
Query OK, 4074980 rows affected (12.87 sec)
Records: 4074980 Duplicates: 0 Warnings: 0
MariaDB [veil]> alter table spike add column dummy3 TEXT;
Query OK, 4074980 rows affected (13.15 sec)
Records: 4074980 Duplicates: 0 Warnings: 0
MariaDB [veil]> select count(*) from spike where a1 = 5 and a2 = 7;
+----------+
| count(*) |
+----------+
| 164788 |
+----------+
1 row in set (0.06 sec)
把行的大小增大也沒(méi)有關(guān)系,因?yàn)閴焊筒粫?huì)去讀真正的行。但是一旦把covering index移除,情況就回到原來(lái)那個(gè)德行了:
MariaDB [veil]> alter table spike drop index a1_a2_idx;
Query OK, 4074980 rows affected (8.74 sec)
Records: 4074980 Duplicates: 0 Warnings: 0
MariaDB [veil]> select count(*) from spike where a1 = 5 and a2 = 7;
+----------+
| count(*) |
+----------+
| 164788 |
+----------+
1 row in set (0.62 sec)
MariaDB [veil]> alter table spike drop index a1_idx;
Query OK, 4074980 rows affected (5.31 sec)
Records: 4074980 Duplicates: 0 Warnings: 0
MariaDB [veil]> alter table spike drop index a2_idx;
Query OK, 4074980 rows affected (1.93 sec)
Records: 4074980 Duplicates: 0 Warnings: 0
MariaDB [veil]> select count(*) from spike where a1 = 5 and a2 = 7;
+----------+
| count(*) |
+----------+
| 164788 |
+----------+
1 row in set (0.72 sec)
所以嘛,wide table的表設(shè)計(jì)確實(shí)會(huì)是一個(gè)問(wèn)題。一旦索引罩不住了,情況就糟糕了。
如果covering index這么棒的話,那為何不給所有的查詢都建立covering index呢?這樣問(wèn)題可不就解決了么。
真是牛逼的想法,試試吧:
MariaDB [veil]> create index a1_a2_a3_idx on spike(a1, a2, a3);
Query OK, 4074980 rows affected (7.03 sec)
Records: 4074980 Duplicates: 0 Warnings: 0
MariaDB [veil]> select count(*) from spike where a1 = 5 and a2 = 7 and a3 = 1;
+----------+
| count(*) |
+----------+
| 6748 |
+----------+
1 row in set (0.01 sec)
真的耶!covering index又立功了。似乎一旦建立了這樣的一個(gè)索引,所有的查詢問(wèn)題都解決了。
MariaDB [veil]> select count(*) from spike where a1 = 5;
+----------+
| count(*) |
+----------+
| 998839 |
+----------+
1 row in set (0.28 sec)
看見(jiàn)沒(méi),它確實(shí)管用!
MariaDB [veil]> select count(*) from spike where a3 = 1;
+----------+
| count(*) |
+----------+
| 207228 |
+----------+
1 row in set (1.06 sec)
我靠,高興太早了。栽了。原因是索引的順序是a1,a2,a3,對(duì)于a1的查詢因?yàn)楹退饕⒌捻樞蛳喾?,所以用的上。類似的a1,a2的查詢也能用得上。但是a3和建立順序不符,就無(wú)法利用上。如果要真正的達(dá)到覆蓋所有的情況,那么將是一個(gè)排列組合的結(jié)果。對(duì)于十幾列的表來(lái)說(shuō),這個(gè)數(shù)字會(huì)大得讓系統(tǒng)無(wú)法承受。
總結(jié)起來(lái)就是,無(wú)論是磁盤(pán)還是內(nèi)存,無(wú)論是附加的屬性表,還是把屬性建模成列,無(wú)論是PostgreSQL還是MySQL都沒(méi)有辦法提供Responsive Query。傳統(tǒng)的按行存儲(chǔ)的數(shù)據(jù)庫(kù)無(wú)法滿足多維數(shù)據(jù)的高速查詢需求。
浙公網(wǎng)安備 33010602011771號(hào)