Matching / broadcasting behavior

DataFrame has the methods add(), sub(), mul(), div() and related functions radd(), rsub(), … for carrying
out binary operations.For broadcasting behavior, Series input is of primary interest. Using these functions,
you can use to either match on the index or columns via the axis keyword:

In [1]:
import numpy as np
import pandas as pd
In [2]:
df = pd.DataFrame({
       'one': pd.Series(np.random.randn(2), index=['a', 'b']),
       'two': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
       'three': pd.Series(np.random.randn(4), index=['b', 'c', 'd','f'])})
In [3]:
df
Out[3]:
one two three
a 1.218453 -0.350691 NaN
b -0.542001 -0.419797 -0.201188
c NaN -0.285277 -0.299671
d NaN NaN -0.909407
f NaN NaN 0.118755
In [4]:
row = df.iloc[1]
In [5]:
column = df['two']
In [6]:
df.sub(row, axis='columns')
Out[6]:
one two three
a 1.760454 0.069106 NaN
b 0.000000 0.000000 0.000000
c NaN 0.134520 -0.098483
d NaN NaN -0.708219
f NaN NaN 0.319943
In [7]:
df.sub(row, axis=1)
Out[7]:
one two three
a 1.760454 0.069106 NaN
b 0.000000 0.000000 0.000000
c NaN 0.134520 -0.098483
d NaN NaN -0.708219
f NaN NaN 0.319943
In [8]:
df.sub(column, axis='index')
Out[8]:
one two three
a 1.569144 0.0 NaN
b -0.122204 0.0 0.218609
c NaN 0.0 -0.014394
d NaN NaN NaN
f NaN NaN NaN
In [9]:
df.sub(column, axis=0)
Out[9]:
one two three
a 1.569144 0.0 NaN
b -0.122204 0.0 0.218609
c NaN 0.0 -0.014394
d NaN NaN NaN
f NaN NaN NaN

Furthermore you can align a level of a MultiIndexed DataFrame with a Series.

In [10]:
dfmi = df.copy()
In [11]:
dfmi.index = pd.MultiIndex.from_tuples([(1, 'a'), (1, 'b'),
                                        (1, 'c'), (2, 'a'),
                                       (2, 'f')],
                                    names=['first', 'second'])
In [12]:
dfmi.sub(column, axis=0, level='second')
Out[12]:
one two three
first second
1 a 1.569144 0.0 NaN
b -0.122204 0.0 0.218609
c NaN 0.0 -0.014394
2 a NaN NaN -0.558716
f NaN NaN NaN

Series and Index also support the divmod() builtin. This function takes the floor division and modulo operation at
the same time returning a two-tuple of the same type as the left hand side. For example:

In [13]:
s = pd.Series(np.arange(10))
In [14]:
s
Out[14]:
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int32
In [15]:
div, rem = divmod(s, 3)
In [16]:
div
Out[16]:
0    0
1    0
2    0
3    1
4    1
5    1
6    2
7    2
8    2
9    3
dtype: int32
In [17]:
rem
Out[17]:
0    0
1    1
2    2
3    0
4    1
5    2
6    0
7    1
8    2
9    0
dtype: int32
In [18]:
idx = pd.Index(np.arange(8))
In [19]:
idx
Out[19]:
Int64Index([0, 1, 2, 3, 4, 5, 6, 7], dtype='int64')
In [20]:
div, rem = divmod(idx, 3)
In [21]:
div
Out[21]:
Int64Index([0, 0, 0, 1, 1, 1, 2, 2], dtype='int64')
In [22]:
rem
Out[22]:
Int64Index([0, 1, 2, 0, 1, 2, 0, 1], dtype='int64')

We can also do elementwise divmod():

In [23]:
div, rem = divmod(s, [1, 1, 2, 2, 3, 3, 4, 4, 5, 5,])
In [24]:
div
Out[24]:
0    0
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
9    1
dtype: int32
In [25]:
rem
Out[25]:
0    0
1    0
2    0
3    1
4    1
5    2
6    2
7    3
8    3
9    4
dtype: int32

Missing data / operations with fill values

In Series and DataFrame, the arithmetic functions have the option of inputting a fill_value, namely a value to
substitute when at most one of the values at a location are missing.For example, when adding two DataFrame objects,
you may wish to treat NaN as 0 unless both DataFrames are missing that value, in which case the result will
be NaN (you can later replace NaN with some other value using fillna if you wish).

In [26]:
df
Out[26]:
one two three
a 1.218453 -0.350691 NaN
b -0.542001 -0.419797 -0.201188
c NaN -0.285277 -0.299671
d NaN NaN -0.909407
f NaN NaN 0.118755
In [27]:
df2 = pd.DataFrame(np.random.randint(low=8, high=10, size=(5, 5)),
                   columns=['a', 'b', 'c', 'd', 'f'])
In [28]:
df2
Out[28]:
a b c d f
0 8 8 9 9 8
1 8 8 8 9 8
2 8 8 9 8 8
3 8 8 8 8 9
4 8 8 9 9 9
In [29]:
df = pd.DataFrame(np.random.randint(low=6, high=8, size=(5, 5)),
                   columns=['a', 'b', 'c', 'd', 'f'])
In [30]:
df
Out[30]:
a b c d f
0 7 7 7 6 6
1 7 7 6 6 7
2 7 6 7 7 6
3 7 7 6 7 7
4 6 6 7 6 6
In [31]:
df + df2
Out[31]:
a b c d f
0 15 15 16 15 14
1 15 15 14 15 15
2 15 14 16 15 14
3 15 15 14 15 16
4 14 14 16 15 15
In [32]:
df.add(df2, fill_value=0)
Out[32]:
a b c d f
0 15 15 16 15 14
1 15 15 14 15 15
2 15 14 16 15 14
3 15 15 14 15 16
4 14 14 16 15 15

Flexible comparisons

Series and DataFrame have the binary comparison methods eq, ne, lt, gt, le, and ge whose behavior is analogous
to the binary arithmetic operations described above:

In [33]:
df.gt(df2)
Out[33]:
a b c d f
0 False False False False False
1 False False False False False
2 False False False False False
3 False False False False False
4 False False False False False
In [34]:
df2.ne(df)
Out[34]:
a b c d f
0 True True True True True
1 True True True True True
2 True True True True True
3 True True True True True
4 True True True True True

These operations produce a pandas object of the same type as the left-hand-side input that is of dtype bool.
These boolean objects can be used in indexing operations.

Boolean reductions

You can apply the reductions: empty, any(), all(), and bool() to provide a way to summarize a boolean result.

In [35]:
(df > 0).all()
Out[35]:
a    True
b    True
c    True
d    True
f    True
dtype: bool
In [36]:
(df > 0).any()
Out[36]:
a    True
b    True
c    True
d    True
f    True
dtype: bool

You can reduce to a final boolean value.

In [37]:
(df > 0).any().any()
Out[37]:
True

You can test if a pandas object is empty, via the empty property.

In [38]:
df.empty
Out[38]:
False
In [39]:
pd.DataFrame(columns=list('ABC')).empty
Out[39]:
True

To evaluate single-element pandas objects in a boolean context, use the method bool():

In [40]:
pd.Series([True]).bool()
Out[40]:
True
In [41]:
pd.Series([False]).bool()
Out[41]:
False
In [42]:
pd.DataFrame([[True]]).bool()
Out[42]:
True
In [43]:
pd.DataFrame([[False]]).bool()
Out[43]:
False

Comparing if objects are equivalent

Often you may find that there is more than one way to compute the same result. As a simple example, consider df + df and df 2. To test that these two computations produce the same result, given the tools shown above, you might imagine using (df + df == df 2).all(). But in fact, this expression is False:

In [44]:
df + df == df * 2
Out[44]:
a b c d f
0 True True True True True
1 True True True True True
2 True True True True True
3 True True True True True
4 True True True True True
In [45]:
(df + df == df * 2).all()
Out[45]:
a    True
b    True
c    True
d    True
f    True
dtype: bool

Notice that the boolean DataFrame df + df == df * 2 contains some False values! This is because NaNs
do not compare as equals:

In [46]:
np.nan == np.nan
Out[46]:
False

So, NDFrames (such as Series and DataFrames) have an equals() method for testing equality, with NaNs in corresponding
locations treated as equal.

In [47]:
(df + df).equals(df * 2)
Out[47]:
True

Note that the Series or DataFrame index needs to be in the same order for equality to be True:

In [48]:
df1 = pd.DataFrame({'col': ['boo', 0, np.nan]})
In [49]:
df2 = pd.DataFrame({'col': [np.nan, 0, 'boo']}, index=[2, 1, 0])
In [50]:
df1.equals(df2)
Out[50]:
False
In [51]:
df1.equals(df2.sort_index())
Out[51]:
True

Comparing array-like objects

You can conveniently perform element-wise comparisons when comparing a pandas data structure with a scalar value:

In [52]:
pd.Series(['boo', 'far', 'baz']) == 'boo'
Out[52]:
0     True
1    False
2    False
dtype: bool
In [53]:
pd.Index(['boo', 'far', 'baz']) == 'boo'
Out[53]:
array([ True, False, False])

Pandas also handles element-wise comparisons between different array-like objects of the same length:

In [54]:
pd.Series(['boo', 'far', 'aaz']) == pd.Index(['boo', 'far', 'qux'])
Out[54]:
0     True
1     True
2    False
dtype: bool
In [55]:
pd.Series(['boo', 'far', 'aaz']) == np.array(['boo', 'far', 'qux'])
Out[55]:
0     True
1     True
2    False
dtype: bool

Trying to compare Index or Series objects of different lengths will raise a ValueError:

In [ ]:
pd.Series(['boo', 'far', 'aaz']) == pd.Series(['boo', 'far'])
ValueError: Series lengths must match to compare
In [ ]:
pd.Series(['boo', 'far', 'aaz']) == pd.Series(['boo'])
ValueError: Series lengths must match to compare

Note that this is different from the NumPy behavior where a comparison can be broadcast:

In [ ]:
np.array([1, 2, 3, 4]) == np.array([3])

Combining overlapping data sets

A problem occasionally arising is the combination of two similar data sets where values in one are preferred
over the other.An example would be two data series representing a particular economic indicator where
one is considered to be of “higher quality”.However, the lower quality series might extend further back in history
or have more complete data coverage.As such, we would like to combine two DataFrame objects where missing values
in one DataFrame are conditionally filled with like-labeled values from the other DataFrame.The function implementing
this operation is combine_first(), which we illustrate:

In [57]:
df1 = pd.DataFrame({'A': [1., np.nan, 4., np.nan],
                    'B': [np.nan, 2., 3., 6.]})
In [58]:
df2 = pd.DataFrame({'A': [1., 2., 4., np.nan, 3.],
                    'B': [np.nan, 3., 4., 8.,5.]})
In [59]:
df1
Out[59]:
A B
0 1.0 NaN
1 NaN 2.0
2 4.0 3.0
3 NaN 6.0
In [60]:
 df2
Out[60]:
A B
0 1.0 NaN
1 2.0 3.0
2 4.0 4.0
3 NaN 8.0
4 3.0 5.0
In [61]:
df1.combine_first(df2)
Out[61]:
A B
0 1.0 NaN
1 2.0 2.0
2 4.0 3.0
3 NaN 6.0
4 3.0 5.0

General DataFrame combine

The combine_first() method above calls the more general DataFrame.combine(). This method takes another
DataFrame and a combiner function, aligns the input DataFrame and then passes the combiner function pairs of
Series (i.e., columns whose names are the same).

So, for instance, to reproduce combine_first() as above:

In [62]:
def combiner(a, b):
    return np.where(pd.isna(a), b, a)