pandas中的分組技術(shù)
目錄
我們在這里要講一個很常用的技術(shù), 就是所謂的分組技術(shù), 這個在數(shù)據(jù)庫中是非常常用的, 要去求某些分組的統(tǒng)計量, 那么我們需要知道在pandas里面, 這些分組技術(shù)是怎么實現(xiàn)的.
分組操作
我們這里要來聊聊在pandas中實現(xiàn)分組運算, 大致上可以按照列, 字典或者Series, 函數(shù), 索引級別進行分組, 我們會逐漸來介紹.
按照列進行分組
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
sep = "---------------------------------------------------------------------------"
data = DataFrame({"key1": ['a', 'a', 'b', 'b', 'a'], "key2": ['one', 'two', 'one', 'two', 'one'], 'data1': np.random.randn(5), 'data2': np.random.randn(5)})
print(data)
data1 data2 key1 key2
0 0.733951 0.000379 a one
1 1.039029 0.852930 a two
2 0.921413 -1.644942 b one
3 0.294560 0.521525 b two
4 0.286072 -0.074574 a one
data1按照key1分組為:
groups = data['data1'].groupby(data['key1'])
我們發(fā)現(xiàn)得到了一個SeriesGroupBy 對象, 現(xiàn)在我們對這個對象進行迭代:
for name, group in groups:
print(name)
print(sep)
print(group)
print(sep)
a
---------------------------------------------------------------------------
0 0.733951
1 1.039029
4 0.286072
Name: data1, dtype: float64
---------------------------------------------------------------------------
b
---------------------------------------------------------------------------
2 0.921413
3 0.294560
Name: data1, dtype: float64
---------------------------------------------------------------------------
我們發(fā)現(xiàn), groups有(key, Series)對組成, key根據(jù)什么來分組的元素, Series(DataFrame)是分組的元素, Series(DataFrame)的name還是原來的列名.
對你分組進行迭代, 用:
for name, group in groups
groups = data.groupby(data['key1'])
for name, group in groups:
print(name)
print(sep)
print(group)
print(sep)
a
---------------------------------------------------------------------------
data1 data2 key1 key2
0 0.733951 0.000379 a one
1 1.039029 0.852930 a two
4 0.286072 -0.074574 a one
---------------------------------------------------------------------------
b
---------------------------------------------------------------------------
data1 data2 key1 key2
2 0.921413 -1.644942 b one
3 0.294560 0.521525 b two
---------------------------------------------------------------------------
groupby就是按照某個值來分組, 無論是對series還是dataframe, 都成立.
我們可以在分好組的對象上調(diào)用統(tǒng)計函數(shù).
data.groupby(data['key1']).mean()
| data1 | data2 | |
|---|---|---|
| key1 | ||
| a | 0.686351 | 0.259578 |
| b | 0.607986 | -0.561709 |
在每個分組上分別對每個每一列求均值, 如果是非數(shù)字列, 或默認剔除.
作業(yè)1:在每個分組上分別對每個每一行求均值.
提示: data.groupby(data['key1']).mean(axis=1)是行不通的.
對于多個列進行分組, 分組的key是對應(yīng)分組元素的元組.
作業(yè)2:對DataFrame用多個列進行分組.
下面其我們來看一個語法糖:
data.groupby([data['key1'], data['key2']])
<pandas.core.groupby.DataFrameGroupBy object at 0x000001D080230278>
它等價于:
data.groupby(['key1', 'key2'])
<pandas.core.groupby.DataFrameGroupBy object at 0x000001D080230630>
我們來驗證一下:
groups =data.groupby([data['key1'], data['key2']])
for name, group in groups:
print(name)
print(sep)
print(group)
print(sep)
('a', 'one')
---------------------------------------------------------------------------
data1 data2 key1 key2
0 0.733951 0.000379 a one
4 0.286072 -0.074574 a one
---------------------------------------------------------------------------
('a', 'two')
---------------------------------------------------------------------------
data1 data2 key1 key2
1 1.039029 0.85293 a two
---------------------------------------------------------------------------
('b', 'one')
---------------------------------------------------------------------------
data1 data2 key1 key2
2 0.921413 -1.644942 b one
---------------------------------------------------------------------------
('b', 'two')
---------------------------------------------------------------------------
data1 data2 key1 key2
3 0.29456 0.521525 b two
---------------------------------------------------------------------------
groups = data.groupby(['key1', 'key2'])
for name, group in groups:
print(name)
print(sep)
print(group)
print(sep)
('a', 'one')
---------------------------------------------------------------------------
data1 data2 key1 key2
0 0.733951 0.000379 a one
4 0.286072 -0.074574 a one
---------------------------------------------------------------------------
('a', 'two')
---------------------------------------------------------------------------
data1 data2 key1 key2
1 1.039029 0.85293 a two
---------------------------------------------------------------------------
('b', 'one')
---------------------------------------------------------------------------
data1 data2 key1 key2
2 0.921413 -1.644942 b one
---------------------------------------------------------------------------
('b', 'two')
---------------------------------------------------------------------------
data1 data2 key1 key2
3 0.29456 0.521525 b two
---------------------------------------------------------------------------
我們發(fā)現(xiàn)輸出結(jié)果是一模一樣, 總結(jié)一下:
data.groupby([data['key1'], data['key2']])等價于data.groupby(['key1', 'key2'])
進一步:
data['data1'].groupby([data['key1'], data['key2']])等價于data.groupby(['key1', 'key2'])['data1']
作業(yè)3: 驗證data['data1'].groupby([data['key1'], data['key2']])等價于data.groupby(['key1', 'key2'])['data1'].
data.groupby(['key1', 'key2'])['data1']
<pandas.core.groupby.SeriesGroupBy object at 0x000001D0FCD95D68>
data.groupby(['key1', 'key2'])[['data1']]
<pandas.core.groupby.DataFrameGroupBy object at 0x000001D080232898>
我不知道大家發(fā)現(xiàn)沒有, 這兩個返回的數(shù)據(jù)類型是有區(qū)別的, 我們仔細來看看:
data[['data1']] # 這是一個DataFrame
| data1 | |
|---|---|
| 0 | 0.733951 |
| 1 | 1.039029 |
| 2 | 0.921413 |
| 3 | 0.294560 |
| 4 | 0.286072 |
data['data1'] # 這是一個Series
0 0.733951
1 1.039029
2 0.921413
3 0.294560
4 0.286072
Name: data1, dtype: float64
那么這里的區(qū)別就不言而喻了吧
groups = data.groupby(['key1', 'key2'])[['data1']]
for name, group in groups:
print(name)
print(sep)
print(group)
print(sep)
('a', 'one')
---------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
---------------------------------------------------------------------------
('a', 'two')
---------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
---------------------------------------------------------------------------
('b', 'one')
---------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
---------------------------------------------------------------------------
('b', 'two')
---------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
---------------------------------------------------------------------------
結(jié)果是一樣的.
data.groupby(['key1', 'key2'])[['data1']].mean()
| data1 | ||
|---|---|---|
| key1 | key2 | |
| a | one | 0.510012 |
| two | 1.039029 | |
| b | one | 0.921413 |
| two | 0.294560 |
data.groupby(['key1', 'key2'])['data1'].mean()
key1 key2
a one 0.510012
two 1.039029
b one 0.921413
two 0.294560
Name: data1, dtype: float64
在做數(shù)據(jù)聚合的時候就發(fā)現(xiàn)了不同,
[['data1']]得到的是一個DataFrame, 而['data1']得到的是Series.
按照字典進行分組
我們來看一個按照字典進行分組的例子:
data = DataFrame(np.random.randn(5, 5), columns=['a', 'b', 'c', 'd', 'e'], index=['joe', 'steve', 'wes', 'jim', 'Travis'])
data
| a | b | c | d | e | |
|---|---|---|---|---|---|
| joe | -0.089597 | 1.239307 | 2.173063 | -0.519295 | -1.783812 |
| steve | 0.539109 | 0.724553 | -0.041899 | 0.787494 | 0.394633 |
| wes | -0.055417 | 0.384068 | -0.594006 | -0.451587 | 0.722761 |
| jim | -0.056767 | 0.398863 | 2.140669 | -1.060791 | -0.953756 |
| Travis | 0.245142 | -0.468819 | -0.863372 | -0.151966 | 1.185567 |
# 定義一個分組的字典, a, b, c --> red, d, e --> blue
mapping = {'a':'red', 'b':'red', 'c': 'red', 'd':'blue', 'e': 'blue'}
data.groupby(mapping, axis=1).mean() # 對每一個分組求平均
| blue | red | |
|---|---|---|
| joe | -1.151554 | 1.107591 |
| steve | 0.591063 | 0.407255 |
| wes | 0.135587 | -0.088452 |
| jim | -1.007273 | 0.827589 |
| Travis | 0.516800 | -0.362350 |
作業(yè)4:自己設(shè)計一個index的mapping, 按axis=0進行分組.
根據(jù)函數(shù)進行分組
話不多說, 直接來看例子:
data.groupby(len).mean()
| a | b | c | d | e | |
|---|---|---|---|---|---|
| 3 | -0.067260 | 0.674079 | 1.239909 | -0.677224 | -0.671602 |
| 5 | 0.539109 | 0.724553 | -0.041899 | 0.787494 | 0.394633 |
| 6 | 0.245142 | -0.468819 | -0.863372 | -0.151966 | 1.185567 |
我們發(fā)現(xiàn), 字典和函數(shù)都是作用到索引上的.
按照list組合
這個例子非常簡單:
data.groupby(['1', '1', '1', '2', '2']).mean()
| a | b | c | d | e | |
|---|---|---|---|---|---|
| 1 | 0.131365 | 0.782643 | 0.512386 | -0.061130 | -0.222139 |
| 2 | 0.094188 | -0.034978 | 0.638649 | -0.606378 | 0.115905 |
他會自動判斷是按照列還是list.
按照索引級別進行分組
作業(yè)5: 自己學(xué)習(xí)按索引級別進行分組.
分組運算
分組運算主要設(shè)計到3個函數(shù), agg, transform和apply.
我們一個一個來看.
agg
data = DataFrame({"key1": ['a', 'a', 'b', 'b', 'a'], "key2": ['one', 'two', 'one', 'two', 'one'], 'data1': np.random.randn(5), 'data2': np.random.randn(5)})
data
| data1 | data2 | key1 | key2 | |
|---|---|---|---|---|
| 0 | 0.441278 | -0.848457 | a | one |
| 1 | 1.843375 | -0.522482 | a | two |
| 2 | -1.435176 | -0.191682 | b | one |
| 3 | -2.700772 | -0.832993 | b | two |
| 4 | -1.430386 | -1.910834 | a | one |
data.groupby("key2").agg(np.mean)
| data1 | data2 | |
|---|---|---|
| key2 | ||
| one | -0.808095 | -0.983658 |
| two | -0.428699 | -0.677738 |
當然, 這個等價于:
data.groupby("key2").mean()
| data1 | data2 | |
|---|---|---|
| key2 | ||
| one | -0.808095 | -0.983658 |
| two | -0.428699 | -0.677738 |
原理: 聚合函數(shù)會在group后的每個切片上(group后的每一行或每一列)計算出值.
我們也可以自定義函數(shù):
data.groupby("key2").agg(lambda x: x.max() - x.min())
| data1 | data2 | |
|---|---|---|
| key2 | ||
| one | 1.876454 | 1.719153 |
| two | 4.544147 | 0.310511 |
他會在每個分組的每個列上調(diào)用這個函數(shù).
data.groupby("key2").agg([np.mean, np.max,np.min])
| data1 | data2 | |||||
|---|---|---|---|---|---|---|
| mean | amax | amin | mean | amax | amin | |
| key2 | ||||||
| one | -0.808095 | 0.441278 | -1.435176 | -0.983658 | -0.191682 | -1.910834 |
| two | -0.428699 | 1.843375 | -2.700772 | -0.677738 | -0.522482 | -0.832993 |
data.groupby("key2").agg([("平均值:", np.mean), ("最大值",np.max), ("最小值",np.min)]).rename({"one": "第一組", "two":"第二組"})
| data1 | data2 | |||||
|---|---|---|---|---|---|---|
| 平均值: | 最大值 | 最小值 | 平均值: | 最大值 | 最小值 | |
| key2 | ||||||
| 第一組 | -0.808095 | 0.441278 | -1.435176 | -0.983658 | -0.191682 | -1.910834 |
| 第二組 | -0.428699 | 1.843375 | -2.700772 | -0.677738 | -0.522482 | -0.832993 |
# 對不同列用不同的分組函數(shù)
data.groupby("key2").agg({"data1":[("平均值:", np.mean), ("最大值",np.max)], "data2":[("最小值",np.min)]}).rename({"one": "第一組", "two":"第二組"})
| data2 | data1 | ||
|---|---|---|---|
| 最小值 | 平均值: | 最大值 | |
| key2 | |||
| 第一組 | -1.910834 | -0.808095 | 0.441278 |
| 第二組 | -0.832993 | -0.428699 | 1.843375 |
transform
transform是一個矢量化的函數(shù), 如果最后我們得到的值和分組切片不一致, 會進行廣播:
data
| data1 | data2 | key1 | key2 | |
|---|---|---|---|---|
| 0 | 0.441278 | -0.848457 | a | one |
| 1 | 1.843375 | -0.522482 | a | two |
| 2 | -1.435176 | -0.191682 | b | one |
| 3 | -2.700772 | -0.832993 | b | two |
| 4 | -1.430386 | -1.910834 | a | one |
data.groupby("key1").transform(np.mean)
| data1 | data2 | |
|---|---|---|
| 0 | 0.284756 | -1.093924 |
| 1 | 0.284756 | -1.093924 |
| 2 | -2.067974 | -0.512338 |
| 3 | -2.067974 | -0.512338 |
| 4 | 0.284756 | -1.093924 |
仔細看, 0,1, 4一組, 2,3一組, 發(fā)生了廣播.
現(xiàn)在有個需求,按分組減去均值.
data.groupby("key1").transform(lambda x: x - x.mean())
| data1 | data2 | |
|---|---|---|
| 0 | 0.156523 | 0.245468 |
| 1 | 1.558619 | 0.571442 |
| 2 | 0.632798 | 0.320656 |
| 3 | -0.632798 | -0.320656 |
| 4 | -1.715142 | -0.816910 |
a, b分組的各列都減去了他們的均值, 不信, 來看:
data.groupby("key1").transform(lambda x: x - x.mean()).groupby([1, 1, 0,0, 1]).mean()
| data1 | data2 | |
|---|---|---|
| 0 | 1.110223e-16 | -5.551115e-17 |
| 1 | 7.401487e-17 | -1.110223e-16 |
apply
這個函數(shù)是transform的加強版, transform只能返回和原來切片大小一樣大的, 但apply是可以任意的. 其實我們之前就用過apply函數(shù), 我們知道, apply是作用在列(行)上的, applymap是作用在函數(shù)上的.
data = DataFrame({"key1": ['a', 'a', 'b', 'b', 'a'], "key2": ['one', 'two', 'one', 'two', 'one'], 'data1': np.random.randn(5), 'data2': np.random.randn(5)})
data
| data1 | data2 | key1 | key2 | |
|---|---|---|---|---|
| 0 | -0.312694 | 0.073574 | a | one |
| 1 | -0.902065 | -0.854249 | a | two |
| 2 | -0.440915 | 0.228551 | b | one |
| 3 | -0.406243 | -0.878505 | b | two |
| 4 | 1.812926 | -0.114598 | a | one |
如果我們要找出one, 和two分組中選出data2最大的前兩個呢?
data.groupby('key2').apply(lambda x: x.sort_values(by='data2')[-2:])
| data1 | data2 | key1 | key2 | ||
|---|---|---|---|---|---|
| key2 | |||||
| one | 0 | -0.312694 | 0.073574 | a | one |
| 2 | -0.440915 | 0.228551 | b | one | |
| two | 3 | -0.406243 | -0.878505 | b | two |
| 1 | -0.902065 | -0.854249 | a | two |
去掉group層次索引:
data.groupby('key2', group_keys=False).apply(lambda x: x.sort_values(by='data2')[-2:])
| data1 | data2 | key1 | key2 | |
|---|---|---|---|---|
| 0 | -0.312694 | 0.073574 | a | one |
| 2 | -0.440915 | 0.228551 | b | one |
| 3 | -0.406243 | -0.878505 | b | two |
| 1 | -0.902065 | -0.854249 | a | two |
總結(jié)一下: apply就是把分完組的切片挨個(按行, 按列, 或者整體)調(diào)用我們的函數(shù), 最后再把結(jié)果合并起來.
利用groupby技術(shù)多進程處理DataFrame
我們這里要教大家用一種groupby技術(shù), 來實現(xiàn)對DataFrame并行處理.
pip install joblib
因為我們windows系統(tǒng)的限制, 我們的代碼是在linux上運行的:
import math
from joblib import Parallel, delayed
from pandas import DataFrame
import pandas as pd
import numpy as np
import time
begin = time.time()
test = DataFrame(np.random.randn(10000000, 10))
test_other = test.copy()
groups = test.groupby(lambda x: x % 8)
def func(x):
return x.applymap(lambda y: math.pow(y, 4))
pd.concat(Parallel(n_jobs=8)(delayed(func)(group) for name, group in groups))
print(time.time() - begin)
begin = time.time()
test_other.applymap(lambda x: math.pow(x, 4))
print(time.time() - begin)
運算結(jié)果為:
23.35878014564514
62.76386260986328
速度大概提升了2.5倍, 還是很不錯的.

浙公網(wǎng)安備 33010602011771號