I have a data set like this:
seq S01-T01 S01-T02 S01-T03 S02-T01 S02-T02 S02-T03 S03-T01 S03-T02 S03-T03
B 7 2 9 2 1 9 2 1 1
C NaN 4 4 2 4 NaN 2 6 8
D 5 NaN NaN 2 5 9 NaN 1 1
I want to get a data frame that:
(1) calculates the mean of all the columns with T01 in them
(2) gets the mean per S-number except for T01 (i.e. get the mean of T02 and T03, for each S field)
(3) get the mean of the list of numbers returned from step 2 (i.e. step 2 will return a list of means, one for each S-number, i then want the mean of that list).
So the output for above would be:
T0_means mean_of_other_means
B 3.6 3.83
C 1.3 4.33
D 2.3 2.6
(i just in my head changed the NaNs to 0 for averaging).
I'm getting stuck at the first step, I wrote:
import sys
import pandas as pd
df = pd.read_csv('fat_norm_extracted.csv',sep=',')
list_cols_to_keep = ['S01-T01','S02-T01','S03-T01']
df = df.loc[df['column_name'].isin(list_of_cols_to_keep)]
print(df)
And the error is:
Traceback (most recent call last):
File "calculate_averages.py", line 6, in <module>
df = df.loc[df['column_name'].isin(list_of_cols_to_keep)]
File "/home/slowat/.conda/envs/embedding_nlp/lib/python3.8/site-packages/pandas/core/frame.py", line 3024, in __getitem__
indexer = self.columns.get_loc(key)
File "/home/slowat/.conda/envs/embedding_nlp/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc
raise KeyError(key) from err
KeyError: 'column_name'
I know what the error means, that column name is being taken as a string, but not how to fix it. Could someone show me a way around this?
CodePudding user response:
The line
df = df.loc[df['column_name'].isin(list_of_cols_to_keep)]
Is filtering the rows of df
where the values of the column named column_name
are in the list of value of list_of_cols_to_keep
.
If you want to select the columns, you can do :
df = df.loc[:, list_of_cols_to_keep]
Where :
is for all rows.
Otherwise you can also use :
df = df.filter(list_of_cols_to_keep)
CodePudding user response:
You can use str.contains
to flag the column names that includes T01
via boolean mask msk
. Then filter the columns using loc
and find mean
across columns for T01_means
. For mean_of_other_means
, you can use msk
by using groupby.cumsum
on it to create groups; then use groupby.mean
across columns to find group means; then use mean
yet again to find the mean of means:
df = df.set_index('seq').fillna(0)
msk = df.columns.str.contains('T01')
df['T0_means'] = df.loc[:, msk].mean(axis=1)
df['mean_of_other_means'] = df.drop(columns='T0_means').loc[:, ~msk].groupby(msk.cumsum()[~msk], axis=1).mean().mean(axis=1)
df = df.reset_index()
Output:
seq S01-T01 S01-T02 S01-T03 S02-T01 S02-T02 S02-T03 S03-T01 S03-T02 S03-T03 T0_means mean_of_other_means
0 B 7.0 2.0 9.0 2 1 9.0 2.0 1 1 3.666667 3.833333
1 C 0.0 4.0 4.0 2 4 0.0 2.0 6 8 1.333333 4.333333
2 D 5.0 0.0 0.0 2 5 9.0 0.0 1 1 2.333333 2.666667