Iterate over two columns and count how many values in one column match with exact values in the seco-CodePudding

I have a data frame that is the output of one application that overlapped mutations with genes. Sometimes big mutations can overlap with more than one gene so the structure of this data frame is like this

mutation1        1gene_affected # mut1 only affected one gene
mutation2        1gene_affected # mut2 has affected 2 genes
mutation2        2gene_affected
mutation3        NO_gene_affected # there is also this. This can be filtered previously.

How can I count somehow the

number of mutations that affect 1 gene,
number of mutations that affect 2 genes,
number of mutations that affect 3 genes,
number of mutations that affect 4 genes,
number of mutations that affect 5 genes,
number of mutations that affect > 5 but <10,
number of mutations that affect >10 but <20,
number of mutations that affect >30 genes,

I would like to save these values in variables and call a function I already created that saves statistics data in a file.

CodePudding user response：

Let's suppose the columns of your dataframe are following : ["mutation", "gene"], using value_counts on mutation will give you the number of occurrence of each mutation. Then a comparison function such as ge will suffice. For instance, to know all mutations affecting exactly X genes :

mask_eq_X = df.loc[:, "mutation"].value_counts().eq(X)
print(df[mask_eq_X])

CodePudding user response：

Clean your second column then use pd.cut:

count = df['mutation'].str.replace('NO_', '0') \
                      .str.extract('^(\d )', expand=False).astype(int)

lbls = ['No gene', '1 gene', '2 genes', '3 genes', '4 genes', '5 genes',
        'between 10 and 20', 'between 20 and 30', 'more than 30 genes']
bins = [-np.inf, 1, 2, 3, 4, 5, 10, 20, 30, np.inf]

df['group'] = pd.cut(count, bins=bins, labels=lbls, right=False)

out = df.value_counts('group', sort=False)

Output:

>>> out
group
No gene               1
1 gene                2
2 genes               1
3 genes               0
4 genes               0
5 genes               0
between 10 and 20     0
between 20 and 30     0
more than 30 genes    0
dtype: int64

Setup:

>>> df
        name          mutation
0  mutation1    1gene_affected
1  mutation2    1gene_affected
2  mutation2    2gene_affected
3  mutation3  NO_gene_affected