Home > Back-end >  Separation of the dataframes by row values
Separation of the dataframes by row values

Time:07-09

I want to split my dataframe based on the first row to generate four separate dataframes (for subgroup analysis). I have a 172x106 Excel file, where the first row consists of either a 1, 2, 3, or 4. The other 171 lines are radiomic features, which I want to copy to the ''new'' dataset. The columns do not have headernames. My data looks like the following:

{0: [4.0, 0.65555056, 0.370511262, 16.5876203, 44.76954415, 48.0, 32.984845, 49.47726751, 49.47726751, 13133.33333, 29.34869973, 0.725907513, 3708.396349, 0.282365204, 13696.0, 2.122884402, 3.039611259, 1419.058749, 1.605529827, 0.488297449], 1: [2.0, 0.82581372, 0.33201741, 20.65753167, 62.21821817, 50.59644256, 62.60990337, 55.56977596, 77.35631842, 23890.66667, 51.38065822, 0.521666786, 7689.706847, 0.321870752, 25152.0, 1.022813615, 1.360453239, 548.2156387, 0.314035581, 0.181204079]}


I wanted to use groupby, but since the column headers have no name, it makes it hard to implement. This is my current code:

import numpy as np
df = pd.read_excel(r'H:\Documenten\MATLAB\sample_file.xlsx',header=None)
Class_1=df.groupby(df.T.loc[:,0])
df_new = Class_1.get_group("1")
print(df_new) 

The error I get is the following:

Traceback (most recent call last):
  File "H:/PycharmProjects/RadiomicsPipeline/spearman_subgroups.py", line 5, in <module>
    df_new = Class_1.get_group("1")
  File "C:\Users\cpullen\AppData\Roaming\Python\Python37\site-packages\pandas\core\groupby\groupby.py", line 754, in get_group
    raise KeyError(name)
KeyError: '1'

How do I implement the separation of the dataframes by row values?

CodePudding user response:

I am not sure that this is the result you want. If not, please clearly show the desired output.

You can simply achieve what you want by transposing your dataframe.

import numpy as np
import pandas as pd

np.random.seed(0)

n = 5
data = np.stack((np.random.choice([1, 2, 3, 4], n), np.random.rand(n), np.random.rand(n), np.random.rand(n)), axis=0)
df = pd.DataFrame(data)

df.head():

0   1   2   3   4
0   1.000000    4.000000    2.000000    1.000000    4.000000
1   0.857946    0.847252    0.623564    0.384382    0.297535
2   0.056713    0.272656    0.477665    0.812169    0.479977
3   0.392785    0.836079    0.337396    0.648172    0.368242
df_transpose = df.transpose()
df_transpose.columns = ['group']   list(df_transpose.columns.values[1:])

df_transpose.head():

group   1   2   3
0   1.0 0.857946    0.056713    0.392785
1   4.0 0.847252    0.272656    0.836079
2   2.0 0.623564    0.477665    0.337396
3   1.0 0.384382    0.812169    0.648172
4   4.0 0.297535    0.479977    0.368242
list_df = dict()

for group in set(df_transpose['group']):
    list_df[group] = df_transpose.loc[df_transpose['group'] == group]

list_df:

{
 1.0:
     group         1         2         3
     0    1.0  0.857946  0.056713  0.392785
     3    1.0  0.384382  0.812169  0.648172,
 2.0:
     group         1         2         3
     2    2.0  0.623564  0.477665  0.337396,
 4.0:
     group         1         2         3
     1    4.0  0.847252  0.272656  0.836079
     4    4.0  0.297535  0.479977  0.368242
}

Individual dataframes:

list_df[1.0], list_df[2.0], list_df[3.0] (which does not exist in this example), list_df[4.0]

(BTW. I defined a wrong variable name. The name should be dict_df, not list_df.)

CodePudding user response:

First of all, after importing the dataframe, sort the value of the first row in order

df
Out[26]: 
   0  1  2  3
0  0  1  0  1
1  1  2  5  8
2  2  3  6  9
3  3  4  7  0

df = df.sort_values(by = 0, axis = 1)

Out[30]:
   0  2  1  3
0  0  0  1  1
1  1  5  2  8
2  2  6  3  9
3  3  7  4  0

You should have a dataframe with ordered 1st row values. After that, you can use df.iloc to rename your column name and you will drop the first row.

df.columns = df.iloc[0]

0  0  0  1  1
0  0  0  1  1
1  1  5  2  8
2  2  6  3  9
3  3  7  4  0

df.drop(0, inplace=True)

df.drop(0)
Out[51]: 
0  0  0  1  1
1  1  5  2  8
2  2  6  3  9
3  3  7  4  0

Eventually, you can do slicing based on the column name.

df_1 = df[1]

df_1
Out[56]: 
0  1  1
1  2  8
2  3  9
3  4  0
  • Related