I have a dataframe such as :
Names1 Gene_name Status
SP1 GENE1 0
SP1 GENE1 1
SP1 GENE1 1
SP1 GENE1 2
SP1 GENE1 2
SP1 GENE2 0
SP3 GENE2 0
SP1 GENE2 1
SP2 GENE2 2
SP4 GENE3 1
SP4 GENE3 2
SP5 GENE3 0
SP5 GENE3 0
Then I would like to fill a new dataframe where each Gene_name
is a column, and each Names
is a row :
Names GENE1 GENE2 GENE3
SP1
SP2
SP3
SP4
SP5
and fill cells Values
depending on the Satus
for each Names
groups
- if only 0 > value = 0
- if only 1 > value = 1
- if both 0 & 1 > value = 0-1
- if both 0 & 2 > value = 0-2
- if both 1 & 2 > value = 1-2
- if both 0 & 1 & 2 > value = 0-1-2
So for example GENE1
in SP1
both present a 0,1 and 2 status, so I fill 0-1-2
within the cell:
Names GENE1 GENE2 GENE3
SP1 0-1-2
SP2
SP3
SP4
SP5
then, SP2,SP3,SP4 and SP5
do not have value for the GENE1
, so I put NA
:
Names GENE1 GENE2 GENE3
SP1 0-1-2
SP2 NA
SP3 NA
SP4 NA
SP5 NA
Then for the GENE2:
GENE2
in SP1
both present a 0 and 1 status, so I fill 0-1
within the cell:
Names GENE1 GENE2 GENE3
SP1 0-1-2 0-1
SP2 NA
SP3 NA
SP4 NA
SP5 NA
GENE2
in SP2
present only a value 2 status, so I fill 2
within the cell:
Names GENE1 GENE2 GENE3
SP1 0-1-2 0-1
SP2 NA 2
SP3 NA
SP4 NA
SP5 NA
GENE2
in SP3
present only a value 0 status, so I fill 0
within the cell:
Names GENE1 GENE2 GENE3
SP1 0-1-2 0-1
SP2 NA 2
SP3 NA 0
SP4 NA
SP5 NA
and the other Names have no GENE2
values, so I put NA
:
Names GENE1 GENE2 GENE3
SP1 0-1-2 0-1
SP2 NA 2
SP3 NA 0
SP4 NA NA
SP5 NA NA
and so on... At the end I should get a full dataframe such as :
Names GENE1 GENE2 GENE3
SP1 0-1-2 0-1 NA
SP2 NA 2 NA
SP3 NA 0 NA
SP4 NA NA 0-2
SP5 NA NA 0
Does someone have an idea please ?
Here is the dict format of the dataframe if it can helps :
{'Names1': {0: 'SP1', 1: 'SP1', 2: 'SP1', 3: 'SP1', 4: 'SP1', 5: 'SP1', 6: 'SP3', 7: 'SP1', 8: 'SP2', 9: 'SP4', 10: 'SP4', 11: 'SP5', 12: 'SP5'}, 'Gene_name': {0: 'GENE1', 1: 'GENE1', 2: 'GENE1', 3: 'GENE1', 4: 'GENE1', 5: 'GENE2', 6: 'GENE2', 7: 'GENE2', 8: 'GENE2', 9: 'GENE3', 10: 'GENE3', 11: 'GENE3', 12: 'GENE3'}, 'Status': {0: 0, 1: 1, 2: 1, 3: 2, 4: 2, 5: 0, 6: 0, 7: 1, 8: 2, 9: 1, 10: 2, 11: 0, 12: 0}}
CodePudding user response:
Code
g = df.groupby(['Names1', 'Gene_name'])
g['Status'].agg(lambda x: '-'.join(x.astype('str').sort_values().unique())).unstack()
output
Gene_name GENE1 GENE2 GENE3
Names1
SP1 0-1-2 0-1 NaN
SP2 NaN 2 NaN
SP3 NaN 0 NaN
SP4 NaN NaN 1-2
SP5 NaN NaN 0
make desired output
(g['Status'].agg(lambda x: '-'.join(x.astype('str').sort_values().unique()))
.unstack().rename_axis(index='Name', columns=''))
result:
GENE1 GENE2 GENE3
Name
SP1 0-1-2 0-1 NaN
SP2 NaN 2 NaN
SP3 NaN 0 NaN
SP4 NaN NaN 1-2
SP5 NaN NaN 0
CodePudding user response:
The above solution would be neater, but just wanted to put out an alternative solution to the same:
import numpy as np
names = df['Names1'].unique()
genes = df['Gene_name'].unique()
result_df = pd.DataFrame({'Names': names})
for gene in genes:
values = []
for name in names:
result = '-'.join(map(str, count_df.loc[(count_df['Names1'] == name) & (count_df['Gene_name'] == gene), ['Status']]['Status'].to_numpy()))
if result == '':
values.append(np.nan)
else:
values.append(result)
result_df[gene] = values
result_df
Output
GENE1 GENE2 GENE3
Names
SP1 0-1-2 0-1 NaN
SP2 NaN 2 NaN
SP3 NaN 0 NaN
SP4 NaN NaN 1-2
SP5 NaN NaN 0
CodePudding user response:
with using pivot table the solutiont can looks like this:
df.pivot_table('Status','Names1','Gene_name',
aggfunc=lambda x: '-'.join(x.astype(str).unique())).rename_axis(columns=None)
>>>
'''
GENE1 GENE2 GENE3
Names1
SP1 0-1-2 0-1 NaN
SP2 NaN 2 NaN
SP3 NaN 0 NaN
SP4 NaN NaN 1-2
SP5 NaN NaN 0