Home > database >  Why is the result of a Pandas' melt Fortran contiguous and not C-contiguous?
Why is the result of a Pandas' melt Fortran contiguous and not C-contiguous?

Time:03-14

I ran into some pandas melt behavior that undermines my mental model of that function and I wonder if somebody could explain why this is sane/logical/desirable behavior.

The following snippet melts down a dataframe and then converts the result into a numpy array. Since I'm melting all columns I would have expected the result to be similar to what np.ndarray.ravel() would do. I.e., create a 1D view into the data and add a column with the respective column names (var names). However, - to my surprise - melt actually makes a copy of the data and reorders it as f-contigous. Why is f-contiguity a good idea here?

expected_flat = np.arange(100*3)
expected_full = expected_flat.reshape(100, 3)

# expected_full is view into flat array
assert expected_full.base is expected_flat
assert expected_flat.flags["C_CONTIGUOUS"]

test_df = pd.DataFrame(
    expected_flat.reshape(100, 3),
    columns=["a", "b", "c"],
)

# test_df, too, is a view into flat array
reconstructed = test_df.to_numpy()
assert reconstructed.base is expected_flat

flatten_melt = test_df.melt(var_name="col", value_name="foobar")
flatten_melt_numpy = flatten_melt.foobar.to_numpy()

# flatten_melt is NOT a view and reordered
assert flatten_melt_numpy.base is not expected_flat
assert np.allclose(flatten_melt_numpy, expected_flat) == False

# the confusing part is that the array is now F-contigous
reconstructed_melt = flatten_melt_numpy.reshape(100, 3, order="F")
assert np.allclose(reconstructed_melt, expected_full)

CodePudding user response:

Construct a frame from a pair of "series":

In [322]: df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
In [323]: df
Out[323]: 
   a  b
0  1  4
1  2  5
2  3  6
In [324]: arr = df.to_numpy()
In [325]: arr
Out[325]: 
array([[1, 4],
       [2, 5],
       [3, 6]])
In [326]: arr.flags
Out[326]: 
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  ...
In [327]: arr.strides
Out[327]: (8, 24)

The resulting array is F_CONTIGUOUS.

If I make a frame from a 2d array, the value is the same as the input, and in this case order 'C':

In [328]: df1 = pd.DataFrame(np.arange(1, 7).reshape(3, 2), columns=["a", "b"])
In [329]: df1
Out[329]: 
   a  b
0  1  2
1  3  4
2  5  6
In [330]: df1.to_numpy().strides
Out[330]: (16, 8)

Create it with an order F, the result is same as in the first case:

In [332]: df1 = pd.DataFrame(np.arange(1, 7).reshape(3, 2, order="F"), columns=[
     ...: "a", "b"])
In [333]: df1
Out[333]: 
   a  b
0  1  4
1  2  5
2  3  6
In [334]: df1.to_numpy().strides
Out[334]: (8, 24)

melt

Going back to the frame created from an order C:

In [335]: df1 = pd.DataFrame(np.arange(1, 7).reshape(3, 2), columns=["a", "b"])
In [336]: df2 = df1.melt()
In [337]: df2
Out[337]: 
  variable  value
0        a      1
1        a      3
2        a      5
3        b      2
4        b      4
5        b      6

Notice how the value column is a vertical concatenation of the 'a' and 'b' columns. This is what the method examples show. I don't use pivot enough to know if this a natural interpretation of that or not.

With the order 'F' frame:

In [338]: df2.to_numpy()
Out[338]: 
array([['a', 1],
       ['a', 3],
       ['a', 5],
       ['b', 2],
       ['b', 4],
       ['b', 6]], dtype=object)
In [339]: _.strides
Out[339]: (8, 48)

In df1 both columns are int dtype, and can be stored as a 2d array:

In [340]: df1.dtypes
Out[340]: 
a    int64
b    int64
dtype: object

df2 columns are different, object (string) and int, so are stored as separate arrays. to_numpy constructs an object dtype array from them, but it is order 'F':

In [341]: df2.dtypes
Out[341]: 
variable    object
value        int64
dtype: object

We get a hint of this storage from:

In [352]: df1._mgr
Out[352]: 
BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumericBlock: slice(0, 2, 1), 2 x 3, dtype: int64
In [353]: df2._mgr
Out[353]: 
BlockManager
Items: Index(['variable', 'value'], dtype='object')
Axis 1: RangeIndex(start=0, stop=6, step=1)
ObjectBlock: slice(0, 1, 1), 1 x 6, dtype: object
NumericBlock: slice(1, 2, 1), 1 x 6, dtype: int64

How a dataframe stores its values is a complex subject, and I have not read a comprehensive description. I've only gathered bits and pieces from experimenting like this.

  • Related