Defining an aggregation using the .groupby method
The most common use of the .groupby method is to perform an aggregation. Before moving ahead, quickly let’s see what an aggregation is. An aggregation takes place when a sequence of many inputs get summarized or combined into a single value output. For example, summing up all the values of a column or finding its maximum are aggregations applied to a sequence of data. An aggregation takes a sequence and reduces it to a single value.
Most aggregations have two other components, the aggregating columns and aggregating functions. The aggregating columns are the columns whose values will be aggregated. The aggregating functions define what aggregations take place. Aggregation functions include sum, min, max, mean, count, variance, std, and so on.
This article is an excerpt from the book Pandas 1.x Cookbook, Second Edition by Matt Harrison and Theodore Petrou. This newly updated and revised edition provides you with unique, idiomatic, and fun recipes for both fundamental and advanced data manipulation tasks with pandas.
In this recipe, we examine the flights dataset and perform the simplest aggregation involving only a single grouping column, a single aggregating column, and a single aggregating function. We will find the average arrival delay for each airline. pandas has different syntaxes to create an aggregation, and this recipe will show them.
How to do it ...
- Read in the flights dataset:
- Define the grouping columns (AIRLINE), aggregating columns (ARR_DELAY), and aggregating functions (mean). Place the grouping column in the .groupby method and then call the .agg method with a dictionary pairing the aggregating column with its aggregating function. If you pass in a dictionary, it returns back a DataFrame instance:
- The string names used in the previous step are a convenience that pandas offers you to refer to a particular aggregation function. You can pass any aggregating function directly to the .agg method, such as the NumPy mean function. The output is the same as the previous step:
- It’s possible to skip the agg method altogether in this case and use the mean method directly. This output is also the same as step 3:
>>> import pandas as pd >>> import numpy as np >>> flights = pd.read_csv(‘data/flights.csv’) >>> flights.head() 0 1 1 4 ... 65.0 0 0 1 1 1 4 ... -13.0 0 0 2 1 1 4 ... 35.0 0 0 3 1 1 4 ... -7.0 0 0 4 1 1 4 ... 39.0 0
>>> (flights ... .groupby(‘AIRLINE’) ... .agg({‘ARR_DELAY’:’mean’}) ... ) ARR_DELAY AIRLINE AA 5.542661 AS -0.833333 B6 8.692593 DL 0.339691 EV 7.034580 ... ... OO 7.593463 UA 7.765755 US 1.681105 VX 5.348884 WN 6.397353
Alternatively, you may place the aggregating column in the index operator and then pass the aggregating function as a string to .agg. This will return a Series:
>>> (flights ... .groupby('AIRLINE') ... ['ARR_DELAY'] ... .agg('mean') ... ) AIRLINE AA 5.542661 AS -0.833333 B6 8.692593 DL 0.339691 EV 7.034580 ... OO 7.593463 UA 7.765755 US 1.681105 VX 5.348884 WN 6.397353 Name: ARR_DELAY, Length: 14, dtype: float64
>>> (flights ... .groupby(‘AIRLINE’) ... [‘ARR_DELAY’] ... .agg(np.mean) ... ) AIRLINE AA 5.542661 AS -0.833333 B6 8.692593 DL 0.339691 EV 7.034580 ... OO 7.593463 UA 7.765755 US 1.681105 VX 5.348884 WN 6.397353 Name: ARR_DELAY, Length: 14, dtype: float64
>>> (flights ... .groupby(‘AIRLINE’) ... [‘ARR_DELAY’] ... .mean() ... ) AIRLINE AA 5.542661 AS -0.833333 B6 8.692593 DL 0.339691 EV 7.034580 ... OO 7.593463 UA 7.765755 US 1.681105 VX 5.348884 WN 6.397353 Name: ARR_DELAY, Length: 14, dtype: float64
How it work ...
The syntax for the .groupby method is not as straightforward as other methods. Let's intercept the chain of methods in step 2 by storing the result of the .groupby method as its own variable:
>>> grouped = flights.groupby('AIRLINE') >>> type(grouped) <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
A completely new intermediate object is first produced with its own distinct attributes and methods. No calculations take place at this stage. Pandas merely validates the grouping columns. This groupby object has an .agg method to perform aggregations. One of the ways to use this method is to pass it a dictionary mapping the aggregating column to the aggregating function, as done in step 2. If you pass in a dictionary, the result will be a DataFrame.
The pandas library often has more than one way to perform the same operation. Step 3 shows another way to perform a groupby. Instead of identifying the aggregating column in the dictionary, place it inside the index operator as if you were selecting it as a column from a DataFrame. The function string name is then passed as a scalar to the .agg method. The result, in this case, is a Series.
You may pass any aggregating function to the .agg method. Pandas allows you to use the string names for simplicity, but you may also explicitly call an aggregating function as done in step 4. NumPy provides many functions that aggregate values.
Step 5 shows one last syntax flavor. When you are only applying a single aggregating function as in this example, you can often call it directly as a method on the groupby object itself without .agg. Not all aggregation functions have a method equivalent, but most do.
For practical, easy to implement recipes for quick solutions to common problems in data using pandas, please refer to the book Pandas 1.x Cookbook, Second Edition by Matt Harrison and Theodore Petrou.
About the authors
Matt Harrison has been using Python since 2000. He runs MetaSnake, which provides corporate training for Python and Data Science. He is the author of Machine Learning Pocket Reference, the bestselling Illustrated Guide to Python 3, and Learning the Pandas Library, among other books.
Theodore Petrou is the founder of Dunder Data, a training company dedicated to helping teach the Python data science ecosystem effectively to individuals and corporations. Read his tutorials and attempt his data science challenges at the Dunder Data website.
- Weekly Trends and Language Statistics
- Weekly Trends and Language Statistics