How to work with Numpy arrays inside Pandas DataFrames-CodePudding

I have Pandas DataFrame with Numpy arrays in columns.

import pandas as pd
import numpy as np

r = lambda x: (np.random.randint(0,2,(2,3),dtype=np.uint8))

riters = (['PartA','PartB'],['T1','T2'],['V1','V2'],['A1','A2'])
ctupls = [('data','stageA'),('data','stageB'),('data','stageC'),('Stage','')]

rindex = pd.MultiIndex.from_product(riters, names=['ID','Temp','VCC','Array'])
cindex = pd.MultiIndex.from_tuples(ctupls, names=[None,'READ'])

dvals = [[r(0) for i in range(len(cindex))] for j in range(len(rindex))]

TST = pd.DataFrame(data=dvals, index=rindex, columns=cindex)

This gives:

TST
Out[187]: 
                                        data  ...                   Stage
READ                                  stageA  ...                        
ID    Temp VCC Array                          ...                        
PartA T1   V1  A1     [[0, 1, 1], [1, 1, 0]]  ...  [[0, 1, 1], [0, 0, 0]]
               A2     [[0, 0, 1], [0, 0, 1]]  ...  [[0, 0, 0], [0, 0, 0]]
           V2  A1     [[1, 1, 0], [0, 1, 0]]  ...  [[1, 1, 0], [1, 0, 0]]
               A2     [[1, 1, 1], [0, 0, 1]]  ...  [[0, 0, 0], [0, 1, 0]]
      T2   V1  A1     [[0, 1, 0], [1, 1, 1]]  ...  [[1, 0, 0], [1, 1, 1]]
               A2     [[1, 1, 0], [0, 1, 1]]  ...  [[0, 0, 0], [0, 1, 0]]
           V2  A1     [[1, 0, 1], [0, 0, 0]]  ...  [[0, 1, 0], [0, 1, 1]]
               A2     [[1, 1, 0], [1, 1, 1]]  ...  [[0, 0, 0], [1, 0, 0]]
PartB T1   V1  A1     [[1, 1, 0], [0, 1, 0]]  ...  [[0, 0, 1], [1, 0, 0]]
               A2     [[1, 1, 0], [0, 1, 0]]  ...  [[0, 1, 1], [0, 0, 0]]
           V2  A1     [[1, 0, 1], [1, 0, 1]]  ...  [[0, 1, 1], [0, 0, 1]]
               A2     [[1, 0, 0], [1, 1, 0]]  ...  [[0, 0, 1], [1, 0, 1]]
      T2   V1  A1     [[1, 0, 1], [0, 1, 0]]  ...  [[0, 1, 1], [0, 1, 1]]
               A2     [[0, 0, 0], [1, 1, 1]]  ...  [[1, 0, 1], [1, 1, 0]]
           V2  A1     [[1, 0, 0], [0, 0, 1]]  ...  [[0, 1, 0], [0, 1, 1]]
               A2     [[1, 1, 0], [1, 0, 0]]  ...  [[1, 1, 1], [1, 1, 0]]

[16 rows x 4 columns]

I want to perform mathematical operations on the Numpy arrays grouped in different ways. I tried a simple example to plot a histogram of all the array values in column ('data','stageA') but got an error, presumably because the DataFrame contains Numpy arrays instead of a list of values?

TST.hist(column=('data','stageA'))
Traceback (most recent call last):

  File "C:\Users\foo\AppData\Local\Temp/ipykernel_28612/8429155.py", line 1, in <module>
    TST.hist(column=('data','stageA'))

  File "C:\Users\foo\Anaconda3\lib\site-packages\pandas\plotting\_core.py", line 226, in hist_frame
    return plot_backend.hist_frame(

  File "C:\Users\foo\Anaconda3\lib\site-packages\pandas\plotting\_matplotlib\hist.py", line 443, in hist_frame
    raise ValueError(

ValueError: hist method requires numerical or datetime columns, nothing to plot.

What is the workaround for this? I want to keep the arrays as Numpy arrays so I can perform Numpy operations on them.

CodePudding user response：

Your dataframe is extremely complex. I suggest you switch to numpy to handle the data with something like:

temp = np.concatenate(([elem for elem in TST['data', 'stageA'].to_numpy()]))
np.histogram(temp, bins = 2)

CodePudding user response：

You can always recover the underlying numpy arrays from a dataframe with .values. So you can either use normal dataframes and extract their np arrays when desired or operate with np using .values after querying the dataframe for a specific "dataset"