Home > OS >  How to unstack a dataset to a certain dataframe?
How to unstack a dataset to a certain dataframe?

Time:11-18

I have a dataset like this

data = {'weight': ['NaN',2,3,4,'NaN',6,7,8,9,'NaN',11,12,13,14,15],
        'MI': ['NaN', 21, 19, 18, 'NaN',16,15,14,13,'NaN',11,10,9,8,7]}
df = pd.DataFrame(data, index= ['group1', "gene1", "gene2", 'gene3',
                      'group2', "gene1", 'gene21', 'gene4', 'gene7', 'group3',
                      'gene2', 'gene10', 'gene3', 'gene43', 'gene1'])

enter image description here

I need to stack it to gene by group dataframe with MI values. If there are no gene values for particular group the imputated value should be 0.1. 'weights' column should be removed. The final dataframe should look like this

enter image description here

CodePudding user response:

You can use:

m = df['weight'].ne('NaN')

(df[m]
   .set_index((~m).cumsum()[m], append=True)['MI']
   .unstack('weight', fill_value=0.1)
   .add_prefix('group')
 )

Variant with pivot:

m = df['weight'].ne('NaN')

(df.assign(col=(~m).cumsum())
   .loc[m]
   .pivot(columns='col', values='MI')
   .fillna(0.1)
   .add_prefix('group')
 )

Output:

weight group1 group2 group3
gene1      21     16      7
gene10    0.1    0.1     10
gene2      19    0.1     11
gene21    0.1     15    0.1
gene3      18    0.1      9
gene4     0.1     14    0.1
gene43    0.1    0.1      8
gene7     0.1     13    0.1

CodePudding user response:

Another option, still with pivot:

(df
.assign(
     grp = np.where(df.weight.eq('NaN'), df.index, np.nan), 
     group = lambda df: df.grp.ffill())
.loc[lambda df: df.weight.ne('NaN'), ['MI', 'group']]
.pivot(index=None, columns='group', values='MI')
.fillna(0.1)
)

group   group1  group2  group3
gene1     21.0    16.0     7.0
gene10     0.1     0.1    10.0
gene2     19.0     0.1    11.0
gene21     0.1    15.0     0.1
gene3     18.0     0.1     9.0
gene4      0.1    14.0     0.1
gene43     0.1     0.1     8.0
gene7      0.1    13.0     0.1

Another process, for the fun of it, and possibly an inkling into the complexities that pivot abstracts:

# get positions for group1,2,3:
positions = df.weight.eq('NaN').to_numpy().nonzero()[0]
groups = df.index[positions]

#get lengths of rows between group1,2,3:
positions = np.diff(np.append(positions, len(df))) - 1

# compute the cumulative positions and pass it to np.split:
arrays = np.split(df.MI.loc[df.MI.ne('NaN')], np.cumsum(positions))

#concatenate the arrays, get the final output:
pd.concat(arrays, axis = 1, keys = groups).fillna(0.1)

        group1  group2  group3
gene1     21.0    16.0     7.0
gene2     19.0     0.1    11.0
gene3     18.0     0.1     9.0
gene21     0.1    15.0     0.1
gene4      0.1    14.0     0.1
gene7      0.1    13.0     0.1
gene10     0.1     0.1    10.0
gene43     0.1     0.1     8.0
  • Related