How can I convert/explode a nested numpy structured array into a pandas dataframe, while keeping the headers from the nested arrays?
Using Python 3.8.3, numpy 1.18.5, pandas 1.3.4.
Example structured array: I am given a nested numpy structured array that looks like this, and I am just rebuilding it here for an MRE.
import numpy as np
import numpy.lib.recfunctions as rfn
arr1 = np.array([4, 5, 4, 5])
arr2 = np.array([0, 0, -1, -1])
arr3 = np.array([0.51, 0.89, 0.59, 0.94])
arr4 = np.array(
[[0.52, 0.80, 0.62, 1.1], [0.41, 0.71, 0.46, 0.77], [0.68, 1.12, 0.78, 1.19]]
).T
arr5 = np.repeat(np.array([0.6, 0.2, 0.2]), 4).reshape(3, 4).T
arrs = (arr1, arr2, arr3, arr4, arr5)
dtypes = [
("state", "f8"),
("variability", "f8"),
("target", "f8"),
("measured", [("mean", "f8"), ("low", "f8"), ("hi", "f8")]),
("var", [("mid", "f8"), ("low", "f8"), ("hi", "f8")]),
]
example = np.column_stack(arrs)
example = rfn.unstructured_to_structured(example, dtype=np.dtype(dtypes))
Inspect example array
print(example)
print(example.dtype.names)
[(4., 0., 0.51, (0.52, 0.41, 0.68), (0.6, 0.2, 0.2))
(5., 0., 0.89, (0.8 , 0.71, 1.12), (0.6, 0.2, 0.2))
(4., -1., 0.59, (0.62, 0.46, 0.78), (0.6, 0.2, 0.2))
(5., -1., 0.94, (1.1 , 0.77, 1.19), (0.6, 0.2, 0.2))]
('state', 'variability', 'target', 'measured', 'var')
print(example["measured"].dtype.names)
('mean', 'low', 'hi')
print(example["var"].dtype.names)
('mid', 'low', 'hi')
Desired pandas dataframe
state | variability | target | measured_mean | measured_low | measured_hi | var_mid | var_low | var_hi |
---|---|---|---|---|---|---|---|---|
4 | 0 | 0.51 | 0.52 | 0.41 | 0.68 | 0.6 | 0.2 | 0.2 |
5 | 0 | 0.89 | 0.8 | 0.71 | 1.12 | 0.6 | 0.2 | 0.2 |
4 | -1 | 0.59 | 0.62 | 0.46 | 0.78 | 0.6 | 0.2 | 0.2 |
5 | -1 | 0.94 | 1.1 | 0.77 | 1.19 | 0.6 | 0.2 | 0.2 |
Attempts
test = pd.DataFrame(example)
print(test)
state variability target measured var
0 4.0 0.0 0.51 (0.52, 0.41, 0.68) (0.6, 0.2, 0.2)
1 5.0 0.0 0.89 (0.8, 0.71, 1.12) (0.6, 0.2, 0.2)
2 4.0 -1.0 0.59 (0.62, 0.46, 0.78) (0.6, 0.2, 0.2)
3 5.0 -1.0 0.94 (1.1, 0.77, 1.19) (0.6, 0.2, 0.2)
How to I unpack the measured and var columns to get/concatenate the column names, as shown above, based on the rec array?
CodePudding user response:
Hmm, did not find an elegant solution, but this brute force seems to work:
df1 = pd.DataFrame(example)
lodf = []
cols_to_use = []
for col_name in example.dtype.names:
if example[col_name].dtype.names:
df0 = pd.DataFrame(list(df1[col_name].values), columns=[f'{col_name}_{c0}' for c0 in example[col_name].dtype.names])
lodf.append(df0)
else:
cols_to_use.append(col_name)
pd.concat([df1[cols_to_use]] lodf, axis=1)