I have a dataframe with a few different types in it. For example:
df = pd.DataFrame({'A': ['A', 'B', 'C', 'D'],
'B': np.random.randint(10, size=4),
'C': np.random.randint(10, size=4),
'D': np.random.rand(4),
'E': np.random.rand(4)})
The dtypes are
>>> df.dtypes
A object
B int32
C int32
D float64
E float64
dtype: object
I want to be able to extract the values of B
and C
in a numpy array of dtype np.int32
directly from the third row of df
. Seems straightforward:
>>> df.iloc[2][['B', 'C']].to_numpy()
array([9, 9], dtype=object)
This is consistent with the fact that the Series is of type object
:
>>> df.iloc[2]
A C
B 9
C 9
D 0.211487
E 0.857848
Name: 2, dtype: object
So maybe I shouldn't get the row first:
>>> df.loc[df.index[2], ['B', 'C']].to_numpy()
array([9, 9], dtype=object)
Still no luck. Of course I can always post-process and do
df.loc[df.index[2], ['B', 'C']].to_numpy().astype(np.int32)
However, is there a way to extract a set of columns of the same dtype with their native dtype into a numpy array using just indexing?
CodePudding user response:
V1
The answer was of course going in the opposite direction from iloc
: extracting columns with a consistent dtype first, so that the row could be a contiguous block:
>>> df[['B', 'C']].iloc[2]
array([9, 9])
Which tells me that I shouldn't be using pandas directly except to load my data to begin with.
V2
It turns out that pd.DataFrame.to_numpy
and pd.Series.to_numpy
have a dtype
argument that you can use to do the conversion. That means that the loc
/iloc
approaches can work too, although this still requires an additional conversion and a-priori knowledge of the dtype:
>>> df.loc[d.index[2], ['B', 'C']].to_numpy(dtype=np.int32)
array([9, 9])
and
>>> df.iloc[2][['B', 'C']].to_numpy(dtype=np.int32)
array([9, 9])
CodePudding user response:
As an addenda to the Mad's answer
In [107]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 4 non-null object
1 B 4 non-null int64
2 C 4 non-null int64
3 D 4 non-null float64
4 E 4 non-null float64
dtypes: float64(2), int64(2), object(1)
memory usage: 288.0 bytes
I stumbled upon the _mgr
which apparently manages how data is actually stored. Looks like it tries to group columns of like dtype together, storing the data asn (#col, #row) arrays:
In [108]: df._mgr
Out[108]:
BlockManager
Items: Index(['A', 'B', 'C', 'D', 'E'], dtype='object')
Axis 1: RangeIndex(start=0, stop=4, step=1)
FloatBlock: slice(3, 5, 1), 2 x 4, dtype: float64
IntBlock: slice(1, 3, 1), 2 x 4, dtype: int64
ObjectBlock: slice(0, 1, 1), 1 x 4, dtype: object
Selecting the 2 int columns:
In [109]: df[['B','C']]._mgr
Out[109]:
BlockManager
Items: Index(['B', 'C'], dtype='object')
Axis 1: RangeIndex(start=0, stop=4, step=1)
IntBlock: slice(0, 2, 1), 2 x 4, dtype: int64
and hence we int
dtype array without further arguments:
In [110]: df[['B','C']].values
Out[110]:
array([[5, 0],
[5, 0],
[0, 5],
[9, 9]])
For single block cases (e.g. all int
columns) the values
is (or at least can be) a view
of frame's data. But that doesn't appear to be the case here.
For a single row:
In [116]: df.iloc[2]._mgr
Out[116]:
SingleBlockManager
Items: Index(['A', 'B', 'C', 'D', 'E'], dtype='object')
ObjectBlock: 5 dtype: object
The row selection is a Series
, so can't have the mixed dtypes of dataframe.
But a "multirow" selection is a frame
In [128]: df.iloc[2][['B','C']].values
Out[128]: array([0, 5], dtype=object)
In [129]: df.iloc[[2]][['B','C']].values
Out[129]: array([[0, 5]])