When one of my column in dataframe is nested list, how should i transform it to multi-dimensional np-CodePudding

I have the following data frame.

test = {
    "a": [[[1,2],[3,4]],[[1,2],[3,4]]],
    "b": [[[1,2],[3,6]],[[1,2],[3,4]]]
}

df = pd.DataFrame(test)
df

	a	b
0	[[1,2],[3,4]]	[[1,2],[3,6]]
1	[[1,2],[3,4]]	[[1,2],[3,4]]

For example, I want to transform the first column to a numpy array with shape (2,2,2). If I use the following code, i will get a array with shape (2,) instead of (2,2,2)

df['a'].apply(np.asarray).values

How can I get the array with shape (2,2,2)?

CodePudding user response：

ah, stupid question. the following code works:

np.array(list(df['a']))

anyone has better solution? thx!

CodePudding user response：

When creating dataframes that contain lists or arrays in the columns, it's a good idea to have a clear sense what's stored.

In [545]: df
Out[545]: 
                  a                 b
0  [[1, 2], [3, 4]]  [[1, 2], [3, 6]]
1  [[1, 2], [3, 4]]  [[1, 2], [3, 4]]

A frame is a 2d object, one column, a Series, is 1d.

to_numpy returns an array (np.array(df) and df.values do the same):

In [546]: df.to_numpy()
Out[546]: 
array([[list([[1, 2], [3, 4]]), list([[1, 2], [3, 6]])],
       [list([[1, 2], [3, 4]]), list([[1, 2], [3, 4]])]], dtype=object)

It's 2d, but object dtype means it contains (references) lists. df.info() also tells us that.

In [547]: df['a'].to_numpy()
Out[547]: array([list([[1, 2], [3, 4]]), list([[1, 2], [3, 4]])], dtype=object)

to_numpy of a column is 1d, again object dtype.

In [548]: df['a'].to_list()
Out[548]: [[[1, 2], [3, 4]], [[1, 2], [3, 4]]]

This is a pure (nested) lists. As with a hand written nested list, it can be turned into an array with:

In [550]: np.array(df['a'].to_list())
Out[550]: 
array([[[1, 2],
        [3, 4]],

       [[1, 2],
        [3, 4]]])

For the array version you need to use stack to combine them:

In [551]: np.stack(df['a'].to_numpy())
Out[551]: 
array([[[1, 2],
        [3, 4]],

       [[1, 2],
        [3, 4]]])

A different concatenation method:

In [552]: np.vstack(df['a'].to_numpy())
Out[552]: 
array([[1, 2],
       [3, 4],
       [1, 2],
       [3, 4]])