I have a data frame that is created by an application and saves info with the following structure: Each row is one mutation that affects genes and transcripts (these are the same gene but different configuration)
data = {'ID': ['mut1', 'mut1', 'mut1', 'mut1', 'mut2'],
'transcript_affected': ["00001", "00002", "00003", "00001", "00001"],
'gene_affected' : ['DIABLO','DIABLO','DIABLO','PLNH3','BRCA1']
}
df = pd.DataFrame(data)
df
ID transcript_affected gene_affected
mut1 00001 DIABLO
mut1 00002 DIABLO
mut1 00003 DIABLO
mut1 00001 PLNH3
mut2 00001 BRCA1
# Mut 1 affects 2 genes (DIABLO and PLNH3), mut2 affects 1 gene
From this I have two questions:
How many mutations affect 1 gene.
How likely (in %) more than one transcript is affected when one gene is affected. I mean, proportion of genes with more than one transcript. For this example would be 33% (there is 3 genes affected but only one affected more than 1)
Some ideas I think it could be done
To know how many mutations affect more than 1 gene I think this would be a start
df.groupby('ID')['transcript_affected'].count()
ID
mut1 4
mut2 1
And then I could count how many IDs its value is more than 1
For the second question
df.groupby('gene_affected')['transcript_affected'].count()
gene_affected
BRCA1 1
DIABLO 3
PLNH3 1
Name: transcript_affected, dtype: int64
Then I could count somehow (I don't) how many were more than 1 (>=2).
CodePudding user response:
please try this:
Answer 1
:
df_1 = df.groupby('ID')['gene_affected'].count().reset_index()
answer_1 = df_1[df_1['gene_affected']>1].shape[0]
Answer 2
:
df_2 = df.groupby('gene_affected')['transcript_affected'].count().reset_index()
answer_2 = (df_2[df_2['transcript_affected']>1].shape[0]/df_2.shape[0]) * 100