dtype Int64 doesn't return view of underlying data?-CodePudding

I have two dataframes of size (5, 5) one of dtype int64 and another of type pd.Int64Dtype.

np.random.seed(2021)
data = np.arange(25).reshape((5, 5))
one  = pd.DataFrame(data, dtype='int64')
two  = pd.DataFrame(data.copy(), dtype='Int64') # Notice the capital 'I'
r, c = np.random.randint(0, 5, (2, 5))

The problem occurs when I try to change the underlying data.

one.to_numpy()[r, c] = 99 # Changes the underlying data
print(one)
    0   1   2   3   4
0   0  99   2   3   4
1   5   6   7   8  99
2  10  11  12  13  14
3  15  99  17  18  19
4  99  99  22  23  24

two.to_numpy()[r, c] = 99 # Doesn't change the underlying data
print(two)
    0   1   2   3   4
0   0   1   2   3   4
1   5   6   7   8   9
2  10  11  12  13  14
3  15  16  17  18  19
4  20  21  22  23  24

I understand that DataFrame.to_numpy doesn't necessarily return a view.

DataFrame.to_numpy():

copy: bool, default False

Whether to ensure that the returned value is not a view on another array. Note that copy=False does not ensure that to_numpy() is no-copy. Rather, copy=True ensure that a copy is made, even if not strictly necessary.

How can I change the given positions(r, c) in DataFrame in a vectorized way? I have a solution using for loop .iloc. For what it's worth, my pandas' version is 1.3.1.

CodePudding user response：

It is correct that the ExtensionBlocks with dtype Int64 are not supportive of numpy assignment because they are considered 5 separate blocks rather than a single numeric block. This affects the ability to produce a uniformly modifiable reference to the underlying structures.

You can observe this by accessing the blocks from the manager (note this is just for observation purposes):

print('One Blocks')
for blk in one._mgr.blocks:
    print(blk)

print('Two Blocks')
for blk in two._mgr.blocks:
    print(blk)

Output:

One Blocks
NumericBlock: slice(0, 5, 1), 5 x 5, dtype: int64
Two Blocks
ExtensionBlock: slice(0, 1, 1), 1 x 5, dtype: Int64
ExtensionBlock: slice(1, 2, 1), 1 x 5, dtype: Int64
ExtensionBlock: slice(2, 3, 1), 1 x 5, dtype: Int64
ExtensionBlock: slice(3, 4, 1), 1 x 5, dtype: Int64
ExtensionBlock: slice(4, 5, 1), 1 x 5, dtype: Int64

Notice that the DataFrame (two) has these as separate underlying structures, meaning that converting to an array calls _interleave which as the comments note "The underlying data was copied within _interleave".

Note this is true for all DataFrames which contain more than one block.

Meaning something as simple as:

df = pd.DataFrame({'A': [1, 2], 'B': ['a', 'b']})
df.to_numpy()[0, 0] = 5  # No Change
print(df)

   A  B
0  1  a
1  2  b

also cannot be modified in this way.

*The blocks for reference

# df._mgr.blocks

NumericBlock: slice(0, 1, 1), 1 x 2, dtype: int64
ObjectBlock: slice(1, 2, 1), 1 x 2, dtype: object

With this in mind, we'd have to use the copy produced by to_numpy and reconstruct the DataFrame:

a = two.to_numpy()  # Store New Array
a[r, c] = 99  # Update The Values
# Reconstruct the DataFrame
two = pd.DataFrame(a, index=two.index, columns=two.columns, dtype='Int64')

astype can also be used with the known dtypes to ensure columns map to the appropriate dtype (this may be helpful in the instance of multiple dtypes):

two = pd.DataFrame(a, index=two.index, columns=two.columns).astype(two.dtypes)

Output:

print(two)

    0   1   2   3   4
0   0  99   2   3   4
1   5   6   7   8  99
2  10  11  12  13  14
3  15  99  17  18  19
4  99  99  22  23  24


print(two.dtypes)
0    Int64
1    Int64
2    Int64
3    Int64
4    Int64
dtype: object

Given this singular replacement however, building a 2D mask with numpy is likely the better approach:

# Build Boolean Mask (default False)
result = np.zeros(two.shape, dtype='bool')
result[r, c] = True  # Set True Locations
two = two.mask(result, 99)  # DataFrame.mask to replace values

Or the inverse mask with DataFrame.where:

# Build Boolean Mask (default True)
result = np.ones(two.shape, dtype='bool')
result[r, c] = False  # Set False Locations
two = two.where(result, 99)  # DataFrame.where to replace values

Both produce:

print(two)
    0   1   2   3   4
0   0  99   2   3   4
1   5   6   7   8  99
2  10  11  12  13  14
3  15  99  17  18  19
4  99  99  22  23  24


print(two.dtypes)
0    Int64
1    Int64
2    Int64
3    Int64
4    Int64
dtype: object

*Benefit of these approaches is that there is no loss of dtype information.