Home > OS >  Nested structured array to pandas dataframe with new column names
Nested structured array to pandas dataframe with new column names

Time:12-23

How can I convert/explode a nested numpy structured array into a pandas dataframe, while keeping the headers from the nested arrays?

Using Python 3.8.3, numpy 1.18.5, pandas 1.3.4.

Example structured array: I am given a nested numpy structured array that looks like this, and I am just rebuilding it here for an MRE.

import numpy as np
import numpy.lib.recfunctions as rfn

arr1 = np.array([4, 5, 4, 5])
arr2 = np.array([0, 0, -1, -1])
arr3 = np.array([0.51, 0.89, 0.59, 0.94])
arr4 = np.array(
    [[0.52, 0.80, 0.62, 1.1], [0.41, 0.71, 0.46, 0.77], [0.68, 1.12, 0.78, 1.19]]
).T
arr5 = np.repeat(np.array([0.6, 0.2, 0.2]), 4).reshape(3, 4).T
arrs = (arr1, arr2, arr3, arr4, arr5)

dtypes = [
    ("state", "f8"),
    ("variability", "f8"),
    ("target", "f8"),
    ("measured", [("mean", "f8"), ("low", "f8"), ("hi", "f8")]),
    ("var", [("mid", "f8"), ("low", "f8"), ("hi", "f8")]),
]

example = np.column_stack(arrs)
example = rfn.unstructured_to_structured(example, dtype=np.dtype(dtypes))

Inspect example array

print(example)
print(example.dtype.names)
[(4.,  0., 0.51, (0.52, 0.41, 0.68), (0.6, 0.2, 0.2))
 (5.,  0., 0.89, (0.8 , 0.71, 1.12), (0.6, 0.2, 0.2))
 (4., -1., 0.59, (0.62, 0.46, 0.78), (0.6, 0.2, 0.2))
 (5., -1., 0.94, (1.1 , 0.77, 1.19), (0.6, 0.2, 0.2))]
('state', 'variability', 'target', 'measured', 'var')
print(example["measured"].dtype.names)

('mean', 'low', 'hi')

print(example["var"].dtype.names)

('mid', 'low', 'hi')

Desired pandas dataframe

state variability target measured_mean measured_low measured_hi var_mid var_low var_hi
4 0 0.51 0.52 0.41 0.68 0.6 0.2 0.2
5 0 0.89 0.8 0.71 1.12 0.6 0.2 0.2
4 -1 0.59 0.62 0.46 0.78 0.6 0.2 0.2
5 -1 0.94 1.1 0.77 1.19 0.6 0.2 0.2

Attempts

test = pd.DataFrame(example)
print(test)
   state  variability  target            measured              var
0    4.0          0.0    0.51  (0.52, 0.41, 0.68)  (0.6, 0.2, 0.2)
1    5.0          0.0    0.89   (0.8, 0.71, 1.12)  (0.6, 0.2, 0.2)
2    4.0         -1.0    0.59  (0.62, 0.46, 0.78)  (0.6, 0.2, 0.2)
3    5.0         -1.0    0.94   (1.1, 0.77, 1.19)  (0.6, 0.2, 0.2)

How to I unpack the measured and var columns to get/concatenate the column names, as shown above, based on the rec array?

CodePudding user response:

Hmm, did not find an elegant solution, but this brute force seems to work:

df1 = pd.DataFrame(example)

lodf = []
cols_to_use = []
for col_name in example.dtype.names:
    if example[col_name].dtype.names:
        df0 = pd.DataFrame(list(df1[col_name].values), columns=[f'{col_name}_{c0}' for c0 in example[col_name].dtype.names])
        lodf.append(df0)
    else:
        cols_to_use.append(col_name)
        
pd.concat([df1[cols_to_use]] lodf, axis=1)
  • Related