Transform and fill a dataframe depending on occurence of values within the dataframe-CodePudding

I have a dataframe such as :

Names1 Gene_name Status
SP1    GENE1     0
SP1    GENE1     1
SP1    GENE1     1
SP1    GENE1     2
SP1    GENE1     2
SP1    GENE2     0
SP3    GENE2     0
SP1    GENE2     1
SP2    GENE2     2
SP4    GENE3     1
SP4    GENE3     2
SP5    GENE3     0    
SP5    GENE3     0

Then I would like to fill a new dataframe where each Gene_name is a column, and each Names is a row :

Names GENE1  GENE2 GENE3
SP1   
SP2   
SP3   
SP4   
SP5

and fill cells Values depending on the Satus for each Names groups

if only 0 > value = 0
if only 1 > value = 1
if both 0 & 1 > value = 0-1
if both 0 & 2 > value = 0-2
if both 1 & 2 > value = 1-2
if both 0 & 1 & 2 > value = 0-1-2

So for example GENE1 in SP1 both present a 0,1 and 2 status, so I fill 0-1-2 within the cell:

Names GENE1  GENE2 GENE3
SP1   0-1-2
SP2   
SP3   
SP4   
SP5

then, SP2,SP3,SP4 and SP5 do not have value for the GENE1, so I put NA :

Names GENE1  GENE2 GENE3
SP1   0-1-2
SP2   NA
SP3   NA
SP4   NA
SP5   NA

Then for the GENE2:

GENE2 in SP1 both present a 0 and 1 status, so I fill 0-1 within the cell:

Names GENE1  GENE2 GENE3
SP1   0-1-2  0-1
SP2   NA
SP3   NA
SP4   NA
SP5   NA

GENE2 in SP2 present only a value 2 status, so I fill 2 within the cell:

Names GENE1  GENE2 GENE3
SP1   0-1-2  0-1
SP2   NA     2
SP3   NA
SP4   NA
SP5   NA

GENE2 in SP3 present only a value 0 status, so I fill 0 within the cell:

Names GENE1  GENE2 GENE3
SP1   0-1-2  0-1
SP2   NA     2
SP3   NA     0
SP4   NA
SP5   NA

and the other Names have no GENE2 values, so I put NA:

Names GENE1  GENE2 GENE3
SP1   0-1-2  0-1
SP2   NA     2
SP3   NA     0
SP4   NA     NA
SP5   NA     NA

and so on... At the end I should get a full dataframe such as :

Names GENE1  GENE2 GENE3
SP1   0-1-2  0-1   NA 
SP2   NA     2     NA
SP3   NA     0     NA
SP4   NA     NA    0-2
SP5   NA     NA    0

Does someone have an idea please ?

Here is the dict format of the dataframe if it can helps :

{'Names1': {0: 'SP1', 1: 'SP1', 2: 'SP1', 3: 'SP1', 4: 'SP1', 5: 'SP1', 6: 'SP3', 7: 'SP1', 8: 'SP2', 9: 'SP4', 10: 'SP4', 11: 'SP5', 12: 'SP5'}, 'Gene_name': {0: 'GENE1', 1: 'GENE1', 2: 'GENE1', 3: 'GENE1', 4: 'GENE1', 5: 'GENE2', 6: 'GENE2', 7: 'GENE2', 8: 'GENE2', 9: 'GENE3', 10: 'GENE3', 11: 'GENE3', 12: 'GENE3'}, 'Status': {0: 0, 1: 1, 2: 1, 3: 2, 4: 2, 5: 0, 6: 0, 7: 1, 8: 2, 9: 1, 10: 2, 11: 0, 12: 0}}

CodePudding user response：

Code

g = df.groupby(['Names1', 'Gene_name'])
g['Status'].agg(lambda x: '-'.join(x.astype('str').sort_values().unique())).unstack()

output

Gene_name   GENE1   GENE2   GENE3
Names1          
SP1         0-1-2   0-1     NaN
SP2         NaN     2       NaN
SP3         NaN     0       NaN
SP4         NaN     NaN     1-2
SP5         NaN     NaN     0

make desired output

(g['Status'].agg(lambda x: '-'.join(x.astype('str').sort_values().unique()))
 .unstack().rename_axis(index='Name', columns=''))

result:

        GENE1   GENE2   GENE3
Name            
SP1     0-1-2   0-1     NaN
SP2     NaN     2       NaN
SP3     NaN     0       NaN
SP4     NaN     NaN     1-2
SP5     NaN     NaN     0

CodePudding user response：

The above solution would be neater, but just wanted to put out an alternative solution to the same:

import numpy as np

names = df['Names1'].unique()  
genes = df['Gene_name'].unique() 
result_df = pd.DataFrame({'Names': names}) 

for gene in genes: 
    values = []
    for name in names: 
        result = '-'.join(map(str, count_df.loc[(count_df['Names1'] == name) & (count_df['Gene_name'] == gene), ['Status']]['Status'].to_numpy()))
        if result == '':
            values.append(np.nan) 
        else:
            values.append(result) 

    result_df[gene] = values 

result_df

Output

        GENE1   GENE2   GENE3
Names            
SP1     0-1-2   0-1     NaN
SP2     NaN     2       NaN
SP3     NaN     0       NaN
SP4     NaN     NaN     1-2
SP5     NaN     NaN     0

CodePudding user response：

with using pivot table the solutiont can looks like this:

df.pivot_table('Status','Names1','Gene_name',
               aggfunc=lambda x: '-'.join(x.astype(str).unique())).rename_axis(columns=None)
>>>
'''
           GENE1 GENE2 GENE3
Names1                      
SP1        0-1-2   0-1   NaN
SP2          NaN     2   NaN
SP3          NaN     0   NaN
SP4          NaN   NaN   1-2
SP5          NaN   NaN     0