Home > Software engineering >  Formating a data frame reducing column and increasing rows
Formating a data frame reducing column and increasing rows

Time:05-05

I have a pandas data frame like this

data = [['NGS_34',0.47,0.55,0.51,0.53,0.54,0.62], ['NGS_38',0.52,0.52,0.49,0.51,0.52,0.45]]
 
df = pd.DataFrame(data, columns = ['probes','01_PLAGL1', '02_PLAGL1','H19','H19', 'H19','GNAS_A/B'])
df = df.set_index('probes')
df

             01_PLAGL1  02_PLAGL1   H19   H19   H19  GNAS_A/B
probes                                                  
NGS_34       0.47       0.55  0.51  0.53  0.54      0.62
NGS_38       0.52       0.52  0.49  0.51  0.52      0.45

This is actually a minimal reproducible example. The real data frame is formed by many paired columns like the example 01_PLAGL1 02_PLAGL1, then 2 sets of three columns like the example H19 H19 H19 and 2 unique columns. With this explanation and the columns of my real dataset below, I think you will understand the input data of my problem.

data_no_control.columns.values

array(['PLAGL1', 'PLAGL1', 'GRB10', 'GRB10', 'MEST', 'MEST', 'H19', 'H19',
       'H19', 'KCNQ1OT1', 'KCNQ1OT1', 'MEG3', 'MEG3', 'MEG8', 'MEG8',
       'SNRPN', 'SNRPN', 'PEG3', 'PEG3', 'PEG3', 'NESP55', 'GNAS-AS1',
       'GNASXL', 'GNASXL', 'GNAS_A/B'], dtype=object)

The final output I would like to achieve should be like this

            01_PLAGL1     H19      GNAS A/B
probes                                                  
NGS_34       0.47         0.51      0.62
             0.55         0.53
                          0.54
(One empty row)
(Second empty row)
NGS_38       0.52         0.49      0.45
             0.52         0.51
                          0.52
(One empty row)
(Second empty row)
NGS_41 ...

I have tried this

df = data_no_control.reset_index(level=0)


empty_rows = 5
df.index = range(0, empty_rows*len(df), empty_rows)
new_df = df.reindex(index=range(empty_rows*len(df)))

new_df = new_df.set_index('index')

new_df

index        01_PLAGL1  02_PLAGL1   H19   H19   H19  GNAS_A/B
                                                  
NGS_34       0.47       0.55  0.51  0.53  0.54      0.62
NaN          NaN         NaN  NaN    NaN. NaN       NaN
NaN          NaN         NaN  NaN    NaN. NaN       NaN
NaN          NaN         NaN  NaN    NaN. NaN       NaN
NaN          NaN         NaN  NaN    NaN. NaN       NaN
NGS_38       0.52       0.52  0.49  0.51  0.52      0.45
NaN          NaN         NaN  NaN    NaN. NaN       NaN
NaN          NaN         NaN  NaN    NaN. NaN       NaN
NaN          NaN         NaN  NaN    NaN. NaN       NaN
NaN          NaN         NaN  NaN    NaN. NaN       NaN

CodePudding user response:

Use:

data = [['NGS_34',0.47,0.55,0.51,0.53,0.54,0.62], ['NGS_38',0.52,0.52,0.49,0.51,0.52,0.45]]

df = pd.DataFrame(data, columns = ['probes','01_PLAGL1', '02_PLAGL1','H19','H19', 'H19','GNAS_A/B'])
df = df.set_index('probes')

#No of new rows
new = 2
#reove values before _ for pairs columns names
s = df.columns.str.split('_').str[-1].to_series()
#create Multiindex by counter
df.columns = [s, s.groupby(s).cumcount()]
#reshape
df = df.stack()
#create MultiIndex for add new rows and original order in columns names
mux = pd.MultiIndex.from_product([df.index.levels[0],
                                  np.arange(df.index.levels[1].max()   new   1)])
df = df.reindex(index=mux, columns=s.unique())
print (df)
          PLAGL1   H19   A/B
probes                      
NGS_34 0    0.47  0.51  0.62
       1    0.55  0.53   NaN
       2     NaN  0.54   NaN
       3     NaN   NaN   NaN
       4     NaN   NaN   NaN
NGS_38 0    0.52  0.49  0.45
       1    0.52  0.51   NaN
       2     NaN  0.52   NaN
       3     NaN   NaN   NaN
       4     NaN   NaN   NaN

Last if need empty values instead misisng values and no counter level use:

df = df.droplevel(1).fillna('')
df.index = df.index.where(~df.index.duplicated(), '')
print (df)
       PLAGL1   H19   A/B
probes                   
NGS_34   0.47  0.51  0.62
         0.55  0.53      
               0.54      
                         
                         
NGS_38   0.52  0.49  0.45
         0.52  0.51      
               0.52      
                         

EDIT: In real data are not duplicates, so ouput is different:

d = {'PLAGL1': {'NGS_34': 0.55, 'NGS_38': 0.52}, 'GRB10': {'NGS_34': 0.48, 'NGS_38': 0.49}, 'MEST': {'NGS_34': 0.56, 'NGS_38': 0.5}, 'H19': {'NGS_34': 0.54, 'NGS_38': 0.52}, 'KCNQ1OT1': {'NGS_34': 0.41, 'NGS_38': 0.49}, 'MEG3': {'NGS_34': 0.5, 'NGS_38': 0.55}, 'MEG8': {'NGS_34': 0.46, 'NGS_38': 0.5}, 'SNRPN': {'NGS_34': 0.55, 'NGS_38': 0.46}, 'PEG3': {'NGS_34': 0.51, 'NGS_38': 0.51}, 'NESP55': {'NGS_34': 0.55, 'NGS_38': 0.53}, 'GNAS-AS1': {'NGS_34': 0.52, 'NGS_38': 0.48}, 'GNASXL': {'NGS_34': 0.49, 'NGS_38': 0.44}, 'GNAS A/B': {'NGS_34': 0.62, 'NGS_38': 0.45}}

df = pd.DataFrame(d)
print (df)

        PLAGL1  GRB10  MEST   H19  KCNQ1OT1  MEG3  MEG8  SNRPN  PEG3  NESP55  \
NGS_34    0.55   0.48  0.56  0.54      0.41  0.50  0.46   0.55  0.51    0.55   
NGS_38    0.52   0.49  0.50  0.52      0.49  0.55  0.50   0.46  0.51    0.53   

        GNAS-AS1  GNASXL  GNAS A/B  
NGS_34      0.52    0.49      0.62  
NGS_38      0.48    0.44      0.45  

#No of new rows
new = 2
#reove values before _ for pairs columns names
s = df.columns.str.split('_').str[-1].to_series()
#create Multiindex by counter
df.columns = [s, s.groupby(s).cumcount()]
#reshape
df = df.stack()
#create MultiIndex for add new rows and original order in columns names
mux = pd.MultiIndex.from_product([df.index.levels[0],
                                  np.arange(df.index.levels[1].max()   new   1)])
df = df.reindex(index=mux, columns=s.unique())

print (df)
          PLAGL1  GRB10  MEST   H19  KCNQ1OT1  MEG3  MEG8  SNRPN  PEG3  \
NGS_34 0    0.55   0.48  0.56  0.54      0.41  0.50  0.46   0.55  0.51   
       1     NaN    NaN   NaN   NaN       NaN   NaN   NaN    NaN   NaN   
       2     NaN    NaN   NaN   NaN       NaN   NaN   NaN    NaN   NaN   
NGS_38 0    0.52   0.49  0.50  0.52      0.49  0.55  0.50   0.46  0.51   
       1     NaN    NaN   NaN   NaN       NaN   NaN   NaN    NaN   NaN   
       2     NaN    NaN   NaN   NaN       NaN   NaN   NaN    NaN   NaN   

          NESP55  GNAS-AS1  GNASXL  GNAS A/B  
NGS_34 0    0.55      0.52    0.49      0.62  
       1     NaN       NaN     NaN       NaN  
       2     NaN       NaN     NaN       NaN  
NGS_38 0    0.53      0.48    0.44      0.45  
       1     NaN       NaN     NaN       NaN  
       2     NaN       NaN     NaN       NaN  
  • Related