Categorical data in a DataFrame:

In [1]:
import numpy as np
import pandas as pd
In [2]:
df = pd.DataFrame({"id": [1, 2, 3, 4, 5],
                    "raw_grade": ['a', 'b', 'c', 'd', 'e']})

Convert the raw grades to a categorical data type.

In [3]:
df["grade"] = df["raw_grade"].astype("category")
In [4]:
df["grade"]
Out[4]:
0    a
1    b
2    c
3    d
4    e
Name: grade, dtype: category
Categories (5, object): [a, b, c, d, e]

Rename the categories to more meaningful names:

In [5]:
df["grade"].cat.categories = ["very bad","very good","better","good","bad"]

Reorder the categories and simultaneously add the missing categories (methods under Series .cat return
a new Series by default).

In [6]:
df["grade"] = df["grade"].cat.set_categories(["very bad","very good","better","good","bad"])
In [7]:
df["grade"]
Out[7]:
0     very bad
1    very good
2       better
3         good
4          bad
Name: grade, dtype: category
Categories (5, object): [very bad, very good, better, good, bad]

Sorting is per order in the categories, not lexical order:

In [8]:
df.sort_values(by="grade")
Out[8]:
id raw_grade grade
0 1 a very bad
1 2 b very good
2 3 c better
3 4 d good
4 5 e bad

Grouping by a categorical column also shows empty categories:

In [9]:
df.groupby("grade").size()
Out[9]:
grade
very bad     1
very good    1
better       1
good         1
bad          1
dtype: int64