I have Pandas DataFrame with Numpy arrays in columns.
import pandas as pd
import numpy as np
r = lambda x: (np.random.randint(0,2,(2,3),dtype=np.uint8))
riters = (['PartA','PartB'],['T1','T2'],['V1','V2'],['A1','A2'])
ctupls = [('data','stageA'),('data','stageB'),('data','stageC'),('Stage','')]
rindex = pd.MultiIndex.from_product(riters, names=['ID','Temp','VCC','Array'])
cindex = pd.MultiIndex.from_tuples(ctupls, names=[None,'READ'])
dvals = [[r(0) for i in range(len(cindex))] for j in range(len(rindex))]
TST = pd.DataFrame(data=dvals, index=rindex, columns=cindex)
This gives:
TST
Out[187]:
data ... Stage
READ stageA ...
ID Temp VCC Array ...
PartA T1 V1 A1 [[0, 1, 1], [1, 1, 0]] ... [[0, 1, 1], [0, 0, 0]]
A2 [[0, 0, 1], [0, 0, 1]] ... [[0, 0, 0], [0, 0, 0]]
V2 A1 [[1, 1, 0], [0, 1, 0]] ... [[1, 1, 0], [1, 0, 0]]
A2 [[1, 1, 1], [0, 0, 1]] ... [[0, 0, 0], [0, 1, 0]]
T2 V1 A1 [[0, 1, 0], [1, 1, 1]] ... [[1, 0, 0], [1, 1, 1]]
A2 [[1, 1, 0], [0, 1, 1]] ... [[0, 0, 0], [0, 1, 0]]
V2 A1 [[1, 0, 1], [0, 0, 0]] ... [[0, 1, 0], [0, 1, 1]]
A2 [[1, 1, 0], [1, 1, 1]] ... [[0, 0, 0], [1, 0, 0]]
PartB T1 V1 A1 [[1, 1, 0], [0, 1, 0]] ... [[0, 0, 1], [1, 0, 0]]
A2 [[1, 1, 0], [0, 1, 0]] ... [[0, 1, 1], [0, 0, 0]]
V2 A1 [[1, 0, 1], [1, 0, 1]] ... [[0, 1, 1], [0, 0, 1]]
A2 [[1, 0, 0], [1, 1, 0]] ... [[0, 0, 1], [1, 0, 1]]
T2 V1 A1 [[1, 0, 1], [0, 1, 0]] ... [[0, 1, 1], [0, 1, 1]]
A2 [[0, 0, 0], [1, 1, 1]] ... [[1, 0, 1], [1, 1, 0]]
V2 A1 [[1, 0, 0], [0, 0, 1]] ... [[0, 1, 0], [0, 1, 1]]
A2 [[1, 1, 0], [1, 0, 0]] ... [[1, 1, 1], [1, 1, 0]]
[16 rows x 4 columns]
I want to perform mathematical operations on the Numpy arrays grouped in different ways. I tried a simple example to plot a histogram of all the array values in column ('data','stageA') but got an error, presumably because the DataFrame contains Numpy arrays instead of a list of values?
TST.hist(column=('data','stageA'))
Traceback (most recent call last):
File "C:\Users\foo\AppData\Local\Temp/ipykernel_28612/8429155.py", line 1, in <module>
TST.hist(column=('data','stageA'))
File "C:\Users\foo\Anaconda3\lib\site-packages\pandas\plotting\_core.py", line 226, in hist_frame
return plot_backend.hist_frame(
File "C:\Users\foo\Anaconda3\lib\site-packages\pandas\plotting\_matplotlib\hist.py", line 443, in hist_frame
raise ValueError(
ValueError: hist method requires numerical or datetime columns, nothing to plot.
What is the workaround for this? I want to keep the arrays as Numpy arrays so I can perform Numpy operations on them.
CodePudding user response:
Your dataframe is extremely complex. I suggest you switch to numpy to handle the data with something like:
temp = np.concatenate(([elem for elem in TST['data', 'stageA'].to_numpy()]))
np.histogram(temp, bins = 2)
CodePudding user response:
You can always recover the underlying numpy arrays from a dataframe with .values
.
So you can either use normal dataframes and extract their np arrays when desired or operate with np using .values
after querying the dataframe for a specific "dataset"