Home > Mobile >  Grouped Column Operations in Python using Pandas
Grouped Column Operations in Python using Pandas

Time:09-17

I have a data frame consisting of a .csv import that contains n number of trials. Trials are arranged by column with a header (wavelength1 for trial 1, wavelength2 for trial 2 etc.) We're tracking the absorption of a solution over time during a chemical reaction. You can see a SS of the excel file in the link. Trials are grouped in to threes (with g of sugar being the IDV and the absorbance in nm being the DV). For each trial:

  1. I need to determine what the maximum and minimum values are. This can of course be done using max() and min() but when we are sampling every 0.25 seconds, the data can be noisy, meaning that I have to smooth it out. I have already built a function to do that. We're also probably just going to sample every one second as it's much smoother anyway.
  2. Each group of three trials needs to be plotted on the same graph for comparison. n number of trials will create n/3 graphs.

I'm coming from an intermediate background in MATLAB. This is not something I was ever able to figure out in there, either.

What have I done so far?

I have attempted to make a list out of the header for each trial, and then use use a for loop to move through the data using the df.column_name command:

data = pd.read_csv('data.csv')
col_name = data.columns.values
print(col_name)
for i in col_name:
    print(data.col_name[i])

The code works up to the 4th line, where it returns the error: AttributeError: 'DataFrame' object has no attribute 'col_name'. Here is where I would like to make a series or set (whatever it's called here) with all of the values from the wavelength1 trial to plot/manipulate/etc. It's worth noting that I have gotten the multiple plots and multiple lines to work manually: but I want to automate it as that's ofc the point of coding. Here's one out of four graphs of the 'manual' version:

import pandas as pd
import matplotlib.pyplot as plt
#import matplotlib as matplotlib
data = pd.read_csv('data.csv')

plt.close("all")

n_rows = 2
n_columns = 2

#initialize figure
figure_size = (30,15)
font_size = 13
f, ([plt1, plt2], [plt3, plt4]) = plt.subplots(n_rows,n_columns, figsize = figure_size)

#plot first three runs 
x=data.time1
y=data.wavelength1
plt1.plot(x,y, label='Trial 1')
x=data.time2
y=data.wavelength2
plt1.plot(x,y,label='Trial 2')
plt1.set_title('0.3g Glucose', fontweight="bold", size=font_size)
x=data.time3
y=data.wavelength3
plt1.plot(x,y,label='Trial 3')
plt1.set_ylabel('Wavelength (nm)', fontsize = font_size)
plt1.set_xlabel('Time (s)', fontsize = font_size)
plt1.legend(fontsize=font_size)

My first thought was just to do:

for i in range (0,num_col):
    plot(time,data.wavelength(i))

But this does not work. I'm sure it's something quite simple but it is escaping me.

Example data:

enter image description here

Final script

(Included for easy copy/paste.)

import pandas as pd
import matplotlib.pyplot as plt
import itertools


def grouper(n, iterable, fillvalue=None):
    "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return itertools.zip_longest(fillvalue=fillvalue, *args)


def get_trials(df, column_group_names):
    """Get columns from dataframe, dropping missing values."""
    column_group = df[list(column_group_names)]
    column_group = column_group.dropna(how='all')
    return column_group


def subplot_axes_iterator(n_rows, n_columns):
    for i in range(n_rows):
        for j in range(n_columns):
            yield i, j



df = pd.read_csv("data.csv")

# Get max and min values
print("ANALYSIS OF MIN AND MAX VALUES")
analysis_df = pd.DataFrame()
analysis_df["min"] = df.min()
analysis_df["min_idx"] = df.idxmin()
analysis_df["max"] = df.max()
analysis_df["max_idx"] = df.idxmax()
print(analysis_df)


# Plot data
n_rows = 2
n_columns = 3
figure_size = (15, 10)
font_size = 13
fig, axes = plt.subplots(n_rows, n_columns, figsize=figure_size)

col_iterator = grouper(3, df.columns)
axes_iterator = subplot_axes_iterator(n_rows, n_columns)
plot_names = [
    "Group 1",
    "Group 2",
    "Group 3",
    "Group 4",
    "Group 5",
]

for column_group_names, axes_position, plot_name in \
        zip(col_iterator, axes_iterator, plot_names):
    print(f"plotting {column_group_names} at {axes_position}")
    column_group = get_trials(df, column_group_names)
    column_group.plot(ax=axes[axes_position])
    axes[axes_position].set_title(plot_name, fontweight="bold", size=font_size)
    axes[axes_position].set_xlabel("Time (s)", fontsize=font_size)
    axes[axes_position].set_ylabel("Wavelength (nm)", fontsize=font_size)
plt.tight_layout()
plt.show()
  • Related