Home > Software engineering >  Group pandas columns by word in common in column name?
Group pandas columns by word in common in column name?

Time:05-05

I have a data set like this:

seq S01-T01 S01-T02 S01-T03 S02-T01 S02-T02 S02-T03 S03-T01 S03-T02 S03-T03
B   7         2       9       2       1       9       2         1       1 
C   NaN       4       4       2       4       NaN     2         6       8
D   5         NaN     NaN     2       5       9       NaN       1       1 

I want to get a data frame that:

(1) calculates the mean of all the columns with T01 in them
(2) gets the mean per S-number except for T01 (i.e. get the mean of T02 and T03, for each S field)
(3) get the mean of the list of numbers returned from step 2 (i.e. step 2 will return a list of means, one for each S-number, i then want the mean of that list).

So the output for above would be:

   T0_means    mean_of_other_means
B  3.6         3.83
C  1.3         4.33
D  2.3         2.6

(i just in my head changed the NaNs to 0 for averaging).

I'm getting stuck at the first step, I wrote:

import sys
import pandas as pd

df = pd.read_csv('fat_norm_extracted.csv',sep=',')
list_cols_to_keep = ['S01-T01','S02-T01','S03-T01']
df = df.loc[df['column_name'].isin(list_of_cols_to_keep)]
print(df)

And the error is:

Traceback (most recent call last):
  File "calculate_averages.py", line 6, in <module>
    df = df.loc[df['column_name'].isin(list_of_cols_to_keep)]
  File "/home/slowat/.conda/envs/embedding_nlp/lib/python3.8/site-packages/pandas/core/frame.py", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/slowat/.conda/envs/embedding_nlp/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: 'column_name'

I know what the error means, that column name is being taken as a string, but not how to fix it. Could someone show me a way around this?

CodePudding user response:

The line

df = df.loc[df['column_name'].isin(list_of_cols_to_keep)]

Is filtering the rows of df where the values of the column named column_name are in the list of value of list_of_cols_to_keep.

If you want to select the columns, you can do :

df = df.loc[:, list_of_cols_to_keep]

Where : is for all rows.

Otherwise you can also use :

df = df.filter(list_of_cols_to_keep)

CodePudding user response:

You can use str.contains to flag the column names that includes T01 via boolean mask msk. Then filter the columns using loc and find mean across columns for T01_means. For mean_of_other_means, you can use msk by using groupby.cumsum on it to create groups; then use groupby.mean across columns to find group means; then use mean yet again to find the mean of means:

df = df.set_index('seq').fillna(0)
msk = df.columns.str.contains('T01')
df['T0_means'] = df.loc[:, msk].mean(axis=1)
df['mean_of_other_means'] = df.drop(columns='T0_means').loc[:, ~msk].groupby(msk.cumsum()[~msk], axis=1).mean().mean(axis=1)
df = df.reset_index()

Output:

  seq  S01-T01  S01-T02  S01-T03  S02-T01  S02-T02  S02-T03  S03-T01  S03-T02  S03-T03  T0_means  mean_of_other_means
0   B      7.0      2.0      9.0        2        1      9.0      2.0        1        1  3.666667             3.833333
1   C      0.0      4.0      4.0        2        4      0.0      2.0        6        8  1.333333             4.333333
2   D      5.0      0.0      0.0        2        5      9.0      0.0        1        1  2.333333             2.666667
  • Related