generate dummy data

np.random.seed(42)
df = pd.DataFrame({'subject': ['A'] * 10   ['B'] * 10,
                   'trial': list(range(5)) * 4,
                   'value1': np.random.randint(0, 100, 20),
                   'value2': np.random.randint(0, 100, 20)
                   })
df = df.set_index(['subject', 'trial']).sort_index()
print(df)


               value1  value2
subject trial                
A       0          51       1
        0          20      75
        1          92      63
        1          82      57
        2          14      59
        2          86      21
        3          71      20
        3          74      88
        4          60      32
        4          74      48
B       0          87      90
        0          52      79
        1          99      58
        1           1      14
        2          23      41
        2          87      61
        3           2      91
        3          29      61
        4          21      59
        4          37      46

Notice: Each subject / trial combination has multiple rows.

I want to create a array with the rows as nested dimensions.

My (as I find ugly) data transformation via list

tmp=list()
for idx in df.index.unique():
   tmp.append(df.loc[idx].to_numpy())
goal = np.array(tmp)
print(goal)


[[[51  1]
  [20 75]]

...

 [[21 59]
  [37 46]]]

Can you show me a native pandas / numpy way to do it (without the list crutch)?

CodePudding user response：

To be able to generate a non-ragged numpy array, the number of duplicates must be equal for all values. Thus you don't have to loop over them. Just find out the number and reshape

n = len(df)/(~df.index.duplicated()).sum()

assert n.is_integer()

out = df.to_numpy().reshape(-1, df.shape[1], int(n))

Output:

array([[[51,  1],
        [20, 75]],

       [[92, 63],
        [82, 57]],

       [[14, 59],
        [86, 21]],

       [[71, 20],
        [74, 88]],

       [[60, 32],
        [74, 48]],

       [[87, 90],
        [52, 79]],

       [[99, 58],
        [ 1, 14]],

       [[23, 41],
        [87, 61]],

       [[ 2, 91],
        [29, 61]],

       [[21, 59],
        [37, 46]]])

CodePudding user response：

You can use stack:

<code>df.stack().values
</code>
Output:
<code>array([[ 0, 25],
       [16, 11],
       [49, 87],
       [38, 77],
       [67,  6],
       [27, 27],
       [40,  0],
       [22, 81],
       [83, 89],
       [36, 55],
       [41,  1],
       [13, 74],
       [88, 61],
       [85, 73],
       [55, 66],
       [44, 82],
       [20, 30],
       [82, 69],
       [37, 71],
       [30, 16],
       [81, 96],
       [ 0, 56],
       [ 5, 99],
       [73, 86]], dtype=int64)
</code>