By “group by” refers a process involving one or more of the following steps:

  • Splitting data into groups based on some criteria
  • Applying function to each group independently
  • Combining results into a data structure
In [2]:
import numpy as np
import pandas as pd
In [9]:
df = pd.DataFrame({'M': ['foo', 'bar', 'foo', 'bar'],
                       'N': ['one', 'one', 'two', 'three'],
                       'O': np.random.randn(4),
                       'P': np.random.randn(4)})
In [10]:
df
Out[10]:
M N O P
0 foo one -0.491250 0.611151
1 bar one 0.428697 0.189252
2 foo two -0.993231 -0.794367
3 bar three 0.238803 -1.466616

Grouping and then applying the sum() function to the resulting groups.

In [11]:
df.groupby('M').sum()
Out[11]:
O P
M
bar 0.667500 -1.277364
foo -1.484481 -0.183216

Grouping by multiple columns forms a hierarchical index, and again we can apply the sum function.

In [12]:
df.groupby(['M', 'N']).sum()
Out[12]:
O P
M N
bar one 0.428697 0.189252
three 0.238803 -1.466616
foo one -0.491250 0.611151
two -0.993231 -0.794367