Given this df
from io import StringIO
import pandas as pd
data = StringIO('''gene_variant gene val1 val2 val3
b1 b 1 1 1
b2 b 2 1 1
b3 b 3 1 1
c2 c 1 1 1
t1 t 1 1 1
t2 t 2 2 2
t4 t 2 3 2
t5 t 1 4 3
d2 d 1 1 2
d4 d 1 1 1''')
df = pd.read_csv(data, sep='\t')
How do I get the gene_variant for each gene where; the gene_variant corresponds to the max value for val1 if the max value is not duplicated, and if it is duplicated, the gene_variant corresponds to the max value for val2 if the max value for val2 is not duplicated, or then just max for val3? I.e., any tiebreakers are decided by the next column until the third option.
I've been trying solutions based on:
df.groupby('gene').agg(max)
and:
df.groupby('gene').rank('max')
But I can't get there without dropping out into iteration...
The correct answer would be:
b3 3
c2 1
t5 4
d2 2
Thanks in advance!
CodePudding user response:
If need maximum only for groups with no duplicated values is possible use:
df1 = df.groupby('gene').transform('nunique')
#if maximum is count from all columns if not duplicated values
max1 = df.where(df1.eq(df1.pop('gene_variant'), axis=0)).max(axis=1)
#if max is count by order - first val1, then val2 ...
max1 = df.where(df1.eq(df1.pop('gene_variant'), axis=0)).bfill(axis=1).iloc[:, 0]
df = df.assign(max1 = max1)
df = df.loc[df.groupby('gene', sort=False)['max1'].idxmax(), ['gene_variant','max1']]
print (df)
gene_variant max1
2 b3 3.0
3 c2 1.0
7 t5 4.0
8 d2 2.0
CodePudding user response:
You could use .sort_values()
to get the maximum values. If you pass it multiple columns, it will treat tiebrakers correctly.
In [9]: df.sort_values(["val1", "val2", "val3"])
Out[9]:
gene_variant gene val1 val2 val3
0 b1 b 1 1 1
3 c2 c 1 1 1
4 t1 t 1 1 1
9 d4 d 1 1 1
8 d2 d 1 1 2
7 t5 t 1 4 3
1 b2 b 2 1 1
5 t2 t 2 2 2
6 t4 t 2 3 2
2 b3 b 3 1 1
Now, in order to do this for each gene you can groupby('gene')
and apply a custom function.
In [11]: df.groupby("gene").apply(
...: lambda _df: _df.sort_values(["val1", "val2", "val3"], ascending=False)
...: .head(1)
...: .squeeze()
...: )
Out[11]:
gene_variant gene val1 val2 val3
gene
b b3 b 3 1 1
c c2 c 1 1 1
d d2 d 1 1 2
t t4 t 2 3 2
However, this is not telling you which val
it was that won the tiebraker.