Counting occurrences with conditions in a pandas data frame-CodePudding

I have a data frame that is created by an application and saves info with the following structure: Each row is one mutation that affects genes and transcripts (these are the same gene but different configuration)

data = {'ID': ['mut1', 'mut1', 'mut1', 'mut1', 'mut2'],
       'transcript_affected': ["00001", "00002", "00003", "00001", "00001"],
       'gene_affected' : ['DIABLO','DIABLO','DIABLO','PLNH3','BRCA1']
       }
df = pd.DataFrame(data)

df
ID      transcript_affected          gene_affected
mut1       00001                       DIABLO    
mut1       00002                       DIABLO
mut1       00003                       DIABLO
mut1       00001                       PLNH3
mut2       00001                       BRCA1

# Mut 1 affects 2 genes (DIABLO and PLNH3), mut2 affects 1 gene

From this I have two questions:

How many mutations affect 1 gene.
How likely (in %) more than one transcript is affected when one gene is affected. I mean, proportion of genes with more than one transcript. For this example would be 33% (there is 3 genes affected but only one affected more than 1)

Some ideas I think it could be done

To know how many mutations affect more than 1 gene I think this would be a start

df.groupby('ID')['transcript_affected'].count()
ID
mut1    4
mut2    1

And then I could count how many IDs its value is more than 1

For the second question

df.groupby('gene_affected')['transcript_affected'].count()
gene_affected
BRCA1     1
DIABLO    3
PLNH3     1
Name: transcript_affected, dtype: int64

Then I could count somehow (I don't) how many were more than 1 (>=2).

CodePudding user response：

please try this:

Answer 1:

df_1 = df.groupby('ID')['gene_affected'].count().reset_index()
answer_1 = df_1[df_1['gene_affected']>1].shape[0]

Answer 2:

df_2 = df.groupby('gene_affected')['transcript_affected'].count().reset_index()
answer_2 = (df_2[df_2['transcript_affected']>1].shape[0]/df_2.shape[0]) * 100