Home > database >  Access segment of row with correct dtype
Access segment of row with correct dtype

Time:09-26

I have a dataframe with a few different types in it. For example:

df = pd.DataFrame({'A': ['A', 'B', 'C', 'D'],
                   'B': np.random.randint(10, size=4),
                   'C': np.random.randint(10, size=4),
                   'D': np.random.rand(4),
                   'E': np.random.rand(4)})

The dtypes are

>>> df.dtypes
A     object
B      int32
C      int32
D    float64
E    float64
dtype: object

I want to be able to extract the values of B and C in a numpy array of dtype np.int32 directly from the third row of df. Seems straightforward:

>>> df.iloc[2][['B', 'C']].to_numpy()
array([9, 9], dtype=object)

This is consistent with the fact that the Series is of type object:

>>> df.iloc[2]
A           C
B           9
C           9
D    0.211487
E    0.857848
Name: 2, dtype: object

So maybe I shouldn't get the row first:

>>> df.loc[df.index[2], ['B', 'C']].to_numpy()
array([9, 9], dtype=object)

Still no luck. Of course I can always post-process and do

df.loc[df.index[2], ['B', 'C']].to_numpy().astype(np.int32)

However, is there a way to extract a set of columns of the same dtype with their native dtype into a numpy array using just indexing?

CodePudding user response:

V1

The answer was of course going in the opposite direction from iloc: extracting columns with a consistent dtype first, so that the row could be a contiguous block:

>>> df[['B', 'C']].iloc[2]
array([9, 9])

Which tells me that I shouldn't be using pandas directly except to load my data to begin with.

V2

It turns out that pd.DataFrame.to_numpy and pd.Series.to_numpy have a dtype argument that you can use to do the conversion. That means that the loc/iloc approaches can work too, although this still requires an additional conversion and a-priori knowledge of the dtype:

>>> df.loc[d.index[2], ['B', 'C']].to_numpy(dtype=np.int32)
array([9, 9])

and

>>> df.iloc[2][['B', 'C']].to_numpy(dtype=np.int32)
array([9, 9])

CodePudding user response:

As an addenda to the Mad's answer

In [107]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       4 non-null      object 
 1   B       4 non-null      int64  
 2   C       4 non-null      int64  
 3   D       4 non-null      float64
 4   E       4 non-null      float64
dtypes: float64(2), int64(2), object(1)
memory usage: 288.0  bytes

I stumbled upon the _mgr which apparently manages how data is actually stored. Looks like it tries to group columns of like dtype together, storing the data asn (#col, #row) arrays:

In [108]: df._mgr
Out[108]: 
BlockManager
Items: Index(['A', 'B', 'C', 'D', 'E'], dtype='object')
Axis 1: RangeIndex(start=0, stop=4, step=1)
FloatBlock: slice(3, 5, 1), 2 x 4, dtype: float64
IntBlock: slice(1, 3, 1), 2 x 4, dtype: int64
ObjectBlock: slice(0, 1, 1), 1 x 4, dtype: object

Selecting the 2 int columns:

In [109]: df[['B','C']]._mgr
Out[109]: 
BlockManager
Items: Index(['B', 'C'], dtype='object')
Axis 1: RangeIndex(start=0, stop=4, step=1)
IntBlock: slice(0, 2, 1), 2 x 4, dtype: int64

and hence we int dtype array without further arguments:

In [110]: df[['B','C']].values
Out[110]: 
array([[5, 0],
       [5, 0],
       [0, 5],
       [9, 9]])

For single block cases (e.g. all int columns) the values is (or at least can be) a view of frame's data. But that doesn't appear to be the case here.

For a single row:

In [116]: df.iloc[2]._mgr
Out[116]: 
SingleBlockManager
Items: Index(['A', 'B', 'C', 'D', 'E'], dtype='object')
ObjectBlock: 5 dtype: object

The row selection is a Series, so can't have the mixed dtypes of dataframe.

But a "multirow" selection is a frame

In [128]: df.iloc[2][['B','C']].values
Out[128]: array([0, 5], dtype=object)
In [129]: df.iloc[[2]][['B','C']].values
Out[129]: array([[0, 5]])
  • Related