Group pandas columns by word in common in column name?-CodePudding

I have a data set like this:

seq S01-T01 S01-T02 S01-T03 S02-T01 S02-T02 S02-T03 S03-T01 S03-T02 S03-T03
B   7         2       9       2       1       9       2         1       1 
C   NaN       4       4       2       4       NaN     2         6       8
D   5         NaN     NaN     2       5       9       NaN       1       1

I want to get a data frame that:

(1) calculates the mean of all the columns with T01 in them
(2) gets the mean per S-number except for T01 (i.e. get the mean of T02 and T03, for each S field)
(3) get the mean of the list of numbers returned from step 2 (i.e. step 2 will return a list of means, one for each S-number, i then want the mean of that list).

So the output for above would be:

   T0_means    mean_of_other_means
B  3.6         3.83
C  1.3         4.33
D  2.3         2.6

(i just in my head changed the NaNs to 0 for averaging).

I'm getting stuck at the first step, I wrote:

import sys
import pandas as pd

df = pd.read_csv('fat_norm_extracted.csv',sep=',')
list_cols_to_keep = ['S01-T01','S02-T01','S03-T01']
df = df.loc[df['column_name'].isin(list_of_cols_to_keep)]
print(df)

And the error is:

Traceback (most recent call last):
  File "calculate_averages.py", line 6, in <module>
    df = df.loc[df['column_name'].isin(list_of_cols_to_keep)]
  File "/home/slowat/.conda/envs/embedding_nlp/lib/python3.8/site-packages/pandas/core/frame.py", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/slowat/.conda/envs/embedding_nlp/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: 'column_name'

I know what the error means, that column name is being taken as a string, but not how to fix it. Could someone show me a way around this?

CodePudding user response：

The line

df = df.loc[df['column_name'].isin(list_of_cols_to_keep)]

Is filtering the rows of df where the values of the column named column_name are in the list of value of list_of_cols_to_keep.

If you want to select the columns, you can do :

df = df.loc[:, list_of_cols_to_keep]

Where : is for all rows.

Otherwise you can also use :

df = df.filter(list_of_cols_to_keep)

CodePudding user response：

You can use str.contains to flag the column names that includes T01 via boolean mask msk. Then filter the columns using loc and find mean across columns for T01_means. For mean_of_other_means, you can use msk by using groupby.cumsum on it to create groups; then use groupby.mean across columns to find group means; then use mean yet again to find the mean of means:

df = df.set_index('seq').fillna(0)
msk = df.columns.str.contains('T01')
df['T0_means'] = df.loc[:, msk].mean(axis=1)
df['mean_of_other_means'] = df.drop(columns='T0_means').loc[:, ~msk].groupby(msk.cumsum()[~msk], axis=1).mean().mean(axis=1)
df = df.reset_index()

Output:

  seq  S01-T01  S01-T02  S01-T03  S02-T01  S02-T02  S02-T03  S03-T01  S03-T02  S03-T03  T0_means  mean_of_other_means
0   B      7.0      2.0      9.0        2        1      9.0      2.0        1        1  3.666667             3.833333
1   C      0.0      4.0      4.0        2        4      0.0      2.0        6        8  1.333333             4.333333
2   D      5.0      0.0      0.0        2        5      9.0      0.0        1        1  2.333333             2.666667