I have a data frame that is the output of one application that overlapped mutations with genes. Sometimes big mutations can overlap with more than one gene so the structure of this data frame is like this
mutation1 1gene_affected # mut1 only affected one gene
mutation2 1gene_affected # mut2 has affected 2 genes
mutation2 2gene_affected
mutation3 NO_gene_affected # there is also this. This can be filtered previously.
How can I count somehow the
number of mutations that affect 1 gene,
number of mutations that affect 2 genes,
number of mutations that affect 3 genes,
number of mutations that affect 4 genes,
number of mutations that affect 5 genes,
number of mutations that affect > 5 but <10,
number of mutations that affect >10 but <20,
number of mutations that affect >30 genes,
I would like to save these values in variables and call a function I already created that saves statistics data in a file.
CodePudding user response:
Let's suppose the columns of your dataframe are following : ["mutation", "gene"]
, using value_counts on mutation will give you the number of occurrence of each mutation. Then a comparison function such as ge
will suffice. For instance, to know all mutations affecting exactly X genes :
mask_eq_X = df.loc[:, "mutation"].value_counts().eq(X)
print(df[mask_eq_X])
CodePudding user response:
Clean your second column then use pd.cut
:
count = df['mutation'].str.replace('NO_', '0') \
.str.extract('^(\d )', expand=False).astype(int)
lbls = ['No gene', '1 gene', '2 genes', '3 genes', '4 genes', '5 genes',
'between 10 and 20', 'between 20 and 30', 'more than 30 genes']
bins = [-np.inf, 1, 2, 3, 4, 5, 10, 20, 30, np.inf]
df['group'] = pd.cut(count, bins=bins, labels=lbls, right=False)
out = df.value_counts('group', sort=False)
Output:
>>> out
group
No gene 1
1 gene 2
2 genes 1
3 genes 0
4 genes 0
5 genes 0
between 10 and 20 0
between 20 and 30 0
more than 30 genes 0
dtype: int64
Setup:
>>> df
name mutation
0 mutation1 1gene_affected
1 mutation2 1gene_affected
2 mutation2 2gene_affected
3 mutation3 NO_gene_affected