In [3]:
import numpy as np
import pandas as pd
In [4]:
index = pd.date_range('1/1/2019', periods=6)
In [5]:
df = pd.DataFrame(np.random.randn(6, 4), index=index,
                  columns=['P', 'Q', 'R','S'])
In [6]:
2019-01-01 -0.224690 0.214687 0.549003 1.210826
2019-01-02 0.908311 0.297399 0.906352 1.899176
In [7]:
df.columns = [x.lower() for x in df.columns]
In [8]:
p q r s
2019-01-01 -0.224690 0.214687 0.549003 1.210826
2019-01-02 0.908311 0.297399 0.906352 1.899176
2019-01-03 0.985992 0.929809 0.480651 -1.168464
2019-01-04 -0.380889 -0.315317 -1.078494 0.267148
2019-01-05 -0.845768 -1.134656 -0.925330 -2.668816
2019-01-06 -1.174685 0.767023 -1.120812 2.209424

Pandas objects (Index, Series, DataFrame) can be thought of as containers for arrays, which hold the actual data
and do the actual computation. For many types, the underlying array is a numpy.ndarray. However, pandas and 3rd party
libraries may extend NumPy’s type system to add support for custom arrays.

To get the actual data inside a Index or Series, use the .array property

In [10]:
import numpy as np
import pandas as pd
In [11]:
s = pd.Series(np.random.randn(6), index=['a', 'b', 'c', 'd', 'e','f'])
In [12]:
[ 0.33828889307955035,  -0.6398233192505693,   0.3983874045716683,
   0.9673670376630227, -0.15334853250655173, -0.23822270531779657]
Length: 6, dtype: float64
In [13]:
['a', 'b', 'c', 'd', 'e', 'f']
Length: 6, dtype: object

array will always be an ExtensionArray. The exact details of what an ExtensionArray is and why pandas
uses them is a bit beyond the scope of this introduction.

If you know you need a NumPy array, use to_numpy() or numpy.asarray().

In [14]:
array([ 0.33828889, -0.63982332,  0.3983874 ,  0.96736704, -0.15334853,
In [15]:
array([ 0.33828889, -0.63982332,  0.3983874 ,  0.96736704, -0.15334853,

When the Series or Index is backed by an ExtensionArray, to_numpy() may involve copying data and coercing values.

to_numpy() gives some control over the dtype of the resulting numpy.ndarray. For example, consider
datetimes with timezones. NumPy doesn’t have a dtype to represent timezone-aware datetimes, so there are
two possibly useful representations:

  1. An object-dtype numpy.ndarray with Timestamp objects, each with the correct tz
  2. A datetime64[ns] -dtype numpy.ndarray, where the values have been converted to UTC
    and the timezone discarded

Timezones may be preserved with dtype=object

In [19]:
ser = pd.Series(pd.date_range('2019', periods=4, tz="CET"))
In [20]:
array([Timestamp('2019-01-01 00:00:00+0100', tz='CET', freq='D'),
       Timestamp('2019-01-02 00:00:00+0100', tz='CET', freq='D'),
       Timestamp('2019-01-03 00:00:00+0100', tz='CET', freq='D'),
       Timestamp('2019-01-04 00:00:00+0100', tz='CET', freq='D')],

Or thrown away with dtype='datetime64[ns]'

In [21]:
array(['2018-12-31T23:00:00.000000000', '2019-01-01T23:00:00.000000000',
       '2019-01-02T23:00:00.000000000', '2019-01-03T23:00:00.000000000'],

Getting the “raw data” inside a DataFrame is possibly a bit more complex. When your DataFrame only has a single
data type for all the columns, DataFrame.to_numpy() will return the underlying data:

In [22]:
array([[-0.22469034,  0.21468709,  0.54900255,  1.21082613],
       [ 0.90831079,  0.29739917,  0.90635223,  1.89917583],
       [ 0.98599168,  0.92980921,  0.48065086, -1.16846406],
       [-0.38088897, -0.31531683, -1.07849442,  0.26714793],
       [-0.84576838, -1.13465577, -0.92533   , -2.66881593],
       [-1.17468468,  0.76702328, -1.12081156,  2.20942395]])

If a DataFrame contains homogeneously-typed data, the ndarray can actually be modified in-place, and the changes
will be reflected in the data structure.For heterogeneous data (e.g. some of the DataFrame’s columns are not all
the same dtype), this will not be the case.The values attribute itself, unlike the axis labels, cannot be assigned to.

Note: When working with heterogeneous data, the dtype of the resulting ndarray will be chosen to accommodate
all of the data involved.For example, if strings are involved, the result will be of object dtype. If there are
only floats and integers, the resulting array will be of float dtype.

In the past, pandas recommended Series.values or DataFrame.values for extracting the data from a Series or DataFrame.
You’ll still find references to these in old code bases and online. Going forward, we recommend avoiding .values and
using .array or .to_numpy(). .values has the following drawbacks:

  1. When your Series contains an extension type, it’s unclear whether Series.values returns a NumPy array or
    the extension array. Series.array will always return an ExtensionArray, and will never copy data.Series.to_numpy()
    will always return a NumPy array, potentially at the cost of copying / coercing values.

  2. When your DataFrame contains a mixture of data types, DataFrame.values may involve copying data and coercing
    values to a common dtype, a relatively expensive operation. DataFrame.to_numpy(), being a method, makes it clearer
    that the returned NumPy array may not be a view on the same data in the DataFrame.