查詢增強插件pgfincore - 教程
查詢增強插件
postgresql使用的是雙緩存結構,其是數(shù)據(jù)讀取的時候,可以直接通過緩存命中返回數(shù)據(jù),用于提高查詢的效率。
在PG中常見的緩存管理插件有pgfincore和pg_premarm。安裝部署都比較簡單。
- pg_prewarm:用于在數(shù)據(jù)庫啟動后主動將表或索引的數(shù)據(jù)頁加載到 shared buffers 或 OS page cache,避免冷啟動時的大量磁盤 I/O。它支持多種加載方式:直接讀入 shared buffers(buffer 模式)、利用內(nèi)核讀緩存(prefetch 模式)、或者兩者結合。典型場景是數(shù)據(jù)庫重啟后提前“熱身”,讓熱點數(shù)據(jù)盡快進入緩存。
- pgfincore:偏重于 緩存狀態(tài)的觀察與控制。它可以查詢某個對象(表、索引)當前在 OS page cache 中的駐留情況,幫助 DBA 判斷哪些數(shù)據(jù)已經(jīng)在內(nèi)存,哪些需要預加載。同時它也提供強制加載或丟棄頁面的能力,更像一個運維工具,聚焦于 Linux 內(nèi)核 page cache 的可見性與操作。
總結:pg_prewarm 更強調(diào) 主動預熱,讓未來訪問更快;而 pgfincore 更強調(diào) 緩存狀態(tài)檢測與精細化控制。前者像“暖身器”,后者像“探測器+遙控器”。
pgfincore
pgfincore 是一個低層級的擴展,它直接利用 Linux 內(nèi)核的 fincore 與 posix_fadvise 系統(tǒng)調(diào)用。主要作用是查看文件塊是否在 OS 頁緩存 (page cache) 中,以及操作這些文件塊的緩存狀態(tài)。截止目前,官方介紹已經(jīng)支持postgresq-16
功能特點
1、檢查表/索引文件的 OS 層緩存命中情況
2、將文件預加載到 Linux 頁緩存 中
3、從 OS 頁緩存中 丟棄數(shù)據(jù),模擬冷緩存場景
使用限制
- PgFincore使用需要POSIX_FADVISE支持
--使用一下指令檢測posix_fadvise模塊的支持
man 2 posix_fadvise
- PostgreSQL >= 8.3
- 無法在windows 系統(tǒng)使用
使用案例:
wget https://github.com/klando/pgfincore/archive/refs/tags/1.3.1.tar.gz
tar -zxvf 1.3.1.tar.gz
cd pgfincore-1.3.1/
make clean
make
make install
psql -c "CREATE EXTENSION pgfincore; "
在創(chuàng)建該插件后會生成以下14個函數(shù)
pgfadvise(regclass, text, integer, OUT relpath text, OUT os_page_size bigint, OUT rel_os_pages bigint, OUT os_pages_free bigint)
pgfadvise_dontneed(regclass, OUT relpath text, OUT os_page_size bigint, OUT rel_os_pages bigint, OUT os_pages_free bigint)
pgfadvise_loader(regclass, integer, boolean, boolean, bit varying, OUT relpath text, OUT os_page_size bigint, OUT os_pages_free bigint, OUT pages_loaded bigint, OUT pages_unloaded bigint )
pgfadvise_loader(regclass, text, integer, boolean, boolean, bit varying, OUT relpath text, OUT os_page_size bigint, OUT os_pages_free bigint, OUT pages_loaded bigint, OUT pages_unloaded bigint )
pgfadvise_normal(regclass, OUT relpath text, OUT os_page_size bigint, OUT rel_os_pages bigint, OUT os_pages_free bigint)
pgfadvise_random(regclass, OUT relpath text, OUT os_page_size bigint, OUT rel_os_pages bigint, OUT os_pages_free bigint)
pgfadvise_sequential(regclass, OUT relpath text, OUT os_page_size bigint, OUT rel_os_pages bigint, OUT os_pages_free bigint)
pgfadvise_willneed(regclass, OUT relpath text, OUT os_page_size bigint, OUT rel_os_pages bigint, OUT os_pages_free bigint)
pgfincore(regclass, boolean, OUT relpath text, OUT segment integer, OUT os_page_size bigint, OUT rel_os_pages bigint, OUT pages_mem bigint, OUT group_mem bigint, OUT os_pages_free bigint, OUT databit bit varying, OUT pages_dirty bigint, OUT group_dirty bigint)
pgfincore(regclass, OUT relpath text, OUT segment integer, OUT os_page_size bigint, OUT rel_os_pages bigint, OUT pages_mem bigint, OUT group_mem bigint, OUT os_pages_free bigint, OUT databit bit varying, OUT pages_dirty bigint, OUT group_dirty bigint)
pgfincore(regclass, text, boolean, OUT relpath text, OUT segment integer, OUT os_page_size bigint, OUT rel_os_pages bigint, OUT pages_mem bigint, OUT group_mem bigint, OUT os_pages_free bigint, OUT databit bit varying, OUT pages_dirty bigint, OUT group_dirty bigint)
pgfincore_drawer(bit varying, OUT drawer cstring)
pgsysconf(OUT os_page_size bigint, OUT os_pages_free bigint, OUT os_total_pages bigint)
pgsysconf_pretty(OUT os_page_size text, OUT os_pages_free text, OUT os_total_pages text)
pgsysconf 查看OS 的頁面大小、可用頁數(shù)、總頁數(shù)。
--單位是頁數(shù)
pgfincore=# SELECT * FROM pgsysconf();
os_page_size | os_pages_free | os_total_pages
--------------+---------------+----------------
4096 | 1817160 | 2038679
(1 row)
-- os_page_size = 4096 → 操作系統(tǒng)頁面大小 = 4KB
-- os_pages_free = 1817160 → 空閑頁數(shù)
-- os_total_pages = 2038679 → 總頁數(shù)
pgfincore=# SELECT * FROM pgsysconf_pretty();
os_page_size | os_pages_free | os_total_pages
--------------+---------------+----------------
4096 bytes | 7098 MB | 7964 MB
(1 row)
/*
os_page_size
直接顯示為 4096 bytes
os_pages_free
計算:1817160 × 4096 / 1024^2
= 1817160 × 4 KB / 1 MB
= 1817160 × 4 / 1024 MB
≈ 7098 MB
os_total_pages
計算:2038679 × 4096 / 1024^2
= 2038679 × 4 / 1024 MB
≈ 7964 MB
*/
pgfincore=# \! free -m
total used free shared buff/cache available
Mem: 7963 491 7098 21 373 7216
Swap: 2047 0 2047
/*
pgsysconf_pretty().os_total_pages = 7964 MB ≈ free -m 中的 total=7963
pgsysconf_pretty().os_pages_free = 7098 MB = free -m 中的 free=7098
*/
--可以看出,查出來的緩沖大小并沒
探測某個表/索引的數(shù)據(jù)頁是否駐留在 OS Page Cache。
-- 查看一個表的緩存情況
postgres=# SELECT * FROM pgfincore('t_corrupt'::regclass);
-[ RECORD 1 ]-+-------------
relpath | base/5/66366
segment | 0
os_page_size | 4096
rel_os_pages | 49076
pages_mem | 49076
group_mem | 1
os_pages_free | 1773181
databit |
pages_dirty | 0
group_dirty | 0
/*
relpath (text):該 relation 在數(shù)據(jù)目錄下對應的文件路徑(相對路徑,如 base/5/66366)
segment (integer):文件段編號(PostgreSQL 的 relation 文件會被拆成多個 segment 文件,每個通常最大 1G,如base/5/66366.1 segment會被顯示為1并一次遞增)
os_page_size (bigint):OS 頁的大小
rel_os_pages (bigint):relation 對應這個 segment 的總 OS 頁數(shù)
pages_mem (bigint):當前在 OS 緩存中的頁數(shù)(即被緩存的頁數(shù))
group_mem (bigint):在 group / 一組之下的頁數(shù)(可能用于按某種分組匯總)
os_pages_free (bigint):操作系統(tǒng)當前可用空閑頁數(shù)(作為參考)
databit (bit varying):按頁給出一個 bitmap / bit 向量,標識各頁是否在緩存中1表示在緩存在,0表示該頁為緩存。
pages_dirty (bigint):臟頁數(shù)(可能在緩存中但尚未寫回磁盤的頁)
group_dirty (bigint):按分組統(tǒng)計的臟頁數(shù)
在某些版本中,你還可以指定是否返回 databit、是否只返回某個文件 fork(main / fsm / vm)等參數(shù)。
添加true可以返回databit
SELECT * FROM pgfincore('t_corrupt'::regclass,true);
*/
-- 當表數(shù)據(jù)超過1GB 時,其展示效果
CREATE UNLOGGED TABLE t_big (id serial, data text);
-- 2. 插入足夠多數(shù)據(jù)使文件超過 1GB
-- 為了快一些,我們用較大的 text
INSERT INTO t_big (data)
SELECT repeat('x', 8192) FROM generate_series(1, 300000);
-- 檢查表大小
postgres=# SELECT pg_size_pretty(pg_relation_size('t_big'));
pg_size_pretty
----------------
3402 MB
(1 row)
postgres=# \! ls -lh $PGDATA/base/5/90956*
-rw------- 1 postgres postgres 1.0G Oct 5 21:32 /home/postgres/pg/data/base/5/90956
-rw------- 1 postgres postgres 1.0G Oct 5 21:35 /home/postgres/pg/data/base/5/90956.1
-rw------- 1 postgres postgres 1.0G Oct 5 21:38 /home/postgres/pg/data/base/5/90956.2
-rw------- 1 postgres postgres 556M Oct 5 21:40 /home/postgres/pg/data/base/5/90956.3
-rw------- 1 postgres postgres 936K Oct 5 21:40 /home/postgres/pg/data/base/5/90956_fsm
-rw------- 1 postgres postgres 0 Oct 5 21:28 /home/postgres/pg/data/base/5/90956_init
-rw------- 1 postgres postgres 8.0K Oct 5 21:29 /home/postgres/pg/data/base/5/90956_vm
postgres=# SELECT * FROM pgfincore('t_big'::regclass);
relpath | segment | os_page_size | rel_os_pages | pages_mem | group_mem | os_pages_free | databit | pages_dirty | group_dirty
----------------+---------+--------------+--------------+-----------+-----------+---------------+---------+-------------+-------------
base/5/90956 | 0 | 4096 | 262144 | 262144 | 1 | 33960 | | 0 | 0
base/5/90956.1 | 1 | 4096 | 262144 | 262144 | 1 | 33960 | | 0 | 0
base/5/90956.2 | 2 | 4096 | 262144 | 262144 | 1 | 33960 | | 0 | 0
base/5/90956.3 | 3 | 4096 | 95436 | 95436 | 1 | 33960 | | 0 | 0
(4 rows)
指定查看特定 fork(比如 ‘fsm’ 或 ‘vm’)
postgres=# SELECT * FROM pgfincore('t_big'::regclass, 'fsm', true);
-[ RECORD 1 ]-+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
relpath | base/5/90956_fsm
segment | 0
os_page_size | 4096
rel_os_pages | 244
pages_mem | 242
group_mem | 1
os_pages_free | 798929
databit | 0011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
pages_dirty | 0
group_dirty | 0
postgres=# SELECT * FROM pgfincore('t_big'::regclass, 'vm', true);
-[ RECORD 1 ]-+----------------
relpath | base/5/90956_vm
segment | 0
os_page_size | 4096
rel_os_pages | 8
pages_mem | 8
group_mem | 1
os_pages_free | 798960
databit | 11111111
pages_dirty | 0
group_dirty | 0
驅逐對應的緩存表
postgres=# SELECT * FROM pgfincore('t_big');
relpath | segment | os_page_size | rel_os_pages | pages_mem | group_mem | os_pages_free | databit | pages_dirty | group_dirty
----------------+---------+--------------+--------------+-----------+-----------+---------------+---------+-------------+-------------
base/5/90956 | 0 | 4096 | 10346 | 10346 | 1 | 1790376 | | 0 | 0
base/5/90956.1 | 1 | 4096 | 0 | 0 | 0 | 1790376 | | 0 | 0
base/5/90956.2 | 2 | 4096 | 0 | 0 | 0 | 1790376 | | 0 | 0
base/5/90956.3 | 3 | 4096 | 0 | 0 | 0 | 1790376 | | 0 | 0
(4 rows)
postgres=# SELECT * FROM pgfadvise_dontneed('t_big');
relpath | os_page_size | rel_os_pages | os_pages_free
----------------+--------------+--------------+---------------
base/5/90956 | 4096 | 10346 | 1800976
base/5/90956.1 | 4096 | 0 | 1800976
base/5/90956.2 | 4096 | 0 | 1800976
base/5/90956.3 | 4096 | 0 | 1800976
(4 rows)
postgres=# SELECT * FROM pgfincore('t_big');
relpath | segment | os_page_size | rel_os_pages | pages_mem | group_mem | os_pages_free | databit | pages_dirty | group_dirty
----------------+---------+--------------+--------------+-----------+-----------+---------------+---------+-------------+-------------
base/5/90956 | 0 | 4096 | 10346 | 0 | 0 | 1801033 | | 0 | 0
base/5/90956.1 | 1 | 4096 | 0 | 0 | 0 | 1801033 | | 0 | 0
base/5/90956.2 | 2 | 4096 | 0 | 0 | 0 | 1801033 | | 0 | 0
base/5/90956.3 | 3 | 4096 | 0 | 0 | 0 | 1801033 | | 0 | 0
(4 rows)
postgres=# \! free -h
total used free shared buff/cache available
Mem: 7.8G 510M 6.9G 17M 418M 7.0G
Swap: 2.0G 520K 2.0G
select os_pages_free * (os_page_size/1024::numeric(26,9)/1024::numeric(26,9)) from pgfincore('t_big'); --該公式算出來的值將等于 free 的剩余內(nèi)存,單位:MB。
加載表到緩存中
這里t_big表的relation 文件由于我再插入數(shù)據(jù)的時候,在未完成的情況下進行了終止操作,所以.1、.2、.3 衍生出來段的數(shù)據(jù)都是死元組,均不可見,也不可緩存,我也未做vacuum。在后續(xù)的SQL中我也做了WHERE databit IS NOT NULL;條件進行過濾。
-- 使用 pgfadvise_willneed,預加載 t_big 的頁到 OS 緩存
postgres=# \! free -h
total used free shared buff/cache available
Mem: 7.8G 509M 6.9G 17M 422M 7.0G
Swap: 2.0G 520K 2.0G
postgres=# SELECT * FROM pgfincore('t_big');
relpath | segment | os_page_size | rel_os_pages | pages_mem | group_mem | os_pages_free | databit | pages_dirty | group_dirty
----------------+---------+--------------+--------------+-----------+-----------+---------------+---------+-------------+-------------
base/5/90956 | 0 | 4096 | 10346 | 0 | 0 | 1800031 | | 0 | 0
base/5/90956.1 | 1 | 4096 | 0 | 0 | 0 | 1800031 | | 0 | 0
base/5/90956.2 | 2 | 4096 | 0 | 0 | 0 | 1800031 | | 0 | 0
base/5/90956.3 | 3 | 4096 | 0 | 0 | 0 | 1800031 | | 0 | 0
(4 rows)
postgres=# SELECT * FROM pgfadvise_willneed('t_big');
relpath | os_page_size | rel_os_pages | os_pages_free
----------------+--------------+--------------+---------------
base/5/90956 | 4096 | 10346 | 1790049
base/5/90956.1 | 4096 | 0 | 1790049
base/5/90956.2 | 4096 | 0 | 1790049
base/5/90956.3 | 4096 | 0 | 1790049
(4 rows)
postgres=# SELECT * FROM pgfincore('t_big');
relpath | segment | os_page_size | rel_os_pages | pages_mem | group_mem | os_pages_free | databit | pages_dirty | group_dirty
----------------+---------+--------------+--------------+-----------+-----------+---------------+---------+-------------+-------------
base/5/90956 | 0 | 4096 | 10346 | 10346 | 1 | 1790135 | | 0 | 0
base/5/90956.1 | 1 | 4096 | 0 | 0 | 0 | 1790135 | | 0 | 0
base/5/90956.2 | 2 | 4096 | 0 | 0 | 0 | 1790135 | | 0 | 0
base/5/90956.3 | 3 | 4096 | 0 | 0 | 0 | 1790135 | | 0 | 0
(4 rows)
#### 加載卸載的精準控制pgfadvise_loader
SELECT * FROM pgfincore('t_big');
SELECT * FROM pgfadvise_loader('t_big', 0, false, true, B'1010100');
SELECT * FROM pgfincore('t_big');
SELECT * FROM pgfadvise_loader('t_big', 0, true, false, B'1010100');
regclass:目標表的 relation
integer:sement的編號
boolean:是否加載,如果true,讀取最后一個參數(shù)的1標識位。
boolean:是否卸載,如果true,讀取最后一個參數(shù)的0標識位。
bit varying:一個 bit vector(databit)表示哪些頁是目標操作對象。0表示不緩存,1表示緩存對應的頁,按照順序其0/1便是代表其所在segment的頁。
-- 當表數(shù)據(jù)超過1GB時
-- 卸載已經(jīng)加載的頁面
SELECT pgfadvise_loader('t_big', seg.segment, false, true, ~seg.databit)
FROM pgfincore('t_big', true) as seg
WHERE databit IS NOT NULL;
--查看卸載情況
select * FROM pgfincore('t_big', true);
~ 是 PostgreSQL 對 bit / bit varying 的按位取反操作:
-- 加載已卸載的頁面
SELECT pgfadvise_loader('t_big', seg.segment, true, false, ~seg.databit)
FROM pgfincore('t_big', true) as seg
WHERE databit IS NOT NULL;
--查看加載情況
select * FROM pgfincore('t_big', true);
-- 加載所有數(shù)據(jù)
SELECT pgfadvise_loader('t_big', seg.segment,true, false, repeat('1', seg.rel_os_pages::integer)::bit varying)
FROM pgfincore('t_big', true) as seg;
-- 或者使用以下方法加載所有數(shù)據(jù)
select pgfadvise_sequential('t_big');
IO規(guī)則預熱
SELECT * FROM pgfadvise_random(‘pg_class’::regclass);
提前調(diào)用 pgfadvise_sequential() 、pgfadvise_random 可以讓 OS 做 連續(xù)頁預讀或者隨機頁讀,此方法對于不同的硬盤優(yōu)勢情況,可以靈活選擇。
-- 創(chuàng)建索引
create index idx_t_big on t_big(id);
-- 卸載所有加載頁面
SELECT pgfadvise_loader('t_big', seg.segment, false, true, ~seg.databit)
FROM pgfincore('t_big', true) as seg
WHERE databit IS NOT NULL;
-- 使用全表順序掃描時
postgres-# select * from t_big;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Seq Scan on t_big (cost=0.00..8173.00 rows=300000 width=108) (actual time=0.341..63.657 rows=300000 loops=1)
Buffers: shared read=5173
Planning:
Buffers: shared hit=16 read=4 dirtied=1
Planning Time: 0.177 ms
Execution Time: 70.076 ms
(6 rows)
select pgfadvise_sequential('t_big');
postgres=# explain (analyze,buffers)
select * from t_big;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Seq Scan on t_big (cost=0.00..8173.00 rows=300000 width=108) (actual time=0.024..23.663 rows=300000 loops=1)
Buffers: shared read=5173
Planning Time: 0.091 ms
Execution Time: 29.922 ms
(4 rows)
-- 再次對比索引掃描時其影響效果。理論上來說,當你緩存索引時,其影響效果依然是比較積極的。個人實驗中頁印證了這一點,不再累述
select pgfadvise_sequential('idx_t_big');
explain (analyze,buffers)
select * from t_big where id <103;
pgfadvise_random 使用方法與pgfadvise_sequential一樣,不再累述。再次調(diào)用pgfadvise_normal,是其返回normal的形式。
pgfincore_drawer 函數(shù)
-- pgfincore_drawer
SELECT pgfincore_drawer(databit) AS drawer FROM pgfincore('t_big',true) where databit is not null ; -- 沒看出有啥作用
pgfadvise函數(shù)
pgfadvise函數(shù)和以上函數(shù)一樣,都是調(diào)用posix_fadvise
posix_fadvise 是 POSIX 標準提供的系統(tǒng)調(diào)用函數(shù)
#include
int posix_fadvise(int fd, off_t offset, off_t len, int advice);
fd?:文件描述符,通過 open 函數(shù)獲取。
offset?:建議應用的起始位置(字節(jié)為單位)。
len?:建議的字節(jié)長度。若為 0,則覆蓋從 offset 到文件末尾的所有數(shù)據(jù)。
advice?:建議類型,常見值包括:
POSIX_FADV_DONTNEED:數(shù)據(jù)近期不會被訪問,可釋放緩存。
POSIX_FADV_WILLNEED:數(shù)據(jù)即將被訪問,可預讀到緩存。
POSIX_FADV_SEQUENTIAL:順序訪問模式,可能增大預讀窗口。
POSIX_FADV_RANDOM:隨機訪問模式,清除預讀緩存。
#define PGF_WILLNEED 10 //加載數(shù)據(jù)
#define PGF_DONTNEED 20 // 卸載數(shù)據(jù)
#define PGF_NORMAL 30 // 恢復預讀標識
#define PGF_SEQUENTIAL 40 // 順序預讀標識
#define PGF_RANDOM 50 // 隨機預讀標識

可以看出pgfadvise則是pgfadvise_normal、pgfadvise_sequential、pgfadvise_random、pgfadvise_dontneed、pgfadvise_willneed的混合用法
pgfadvise_normal函數(shù)則將其預讀訪問模式擦掉。
SELECT *
FROM pgfadvise('t_big'::regclass, 'main', 10);
relpath | os_page_size | rel_os_pages | os_pages_free
--------------+--------------+--------------+---------------
base/5/90963 | 4096 | 10346 | 1788377

浙公網(wǎng)安備 33010602011771號