Home > OS >  Single row (from flatten DataFrame) into a DataFrame in Python
Single row (from flatten DataFrame) into a DataFrame in Python

Time:11-01

I have a .csv file containing one million of rows. Each row corresponds to two flatten DataFrames and the index correspond to a unique attribute.

For example in my .csv file my the first two rows are like :

S1_1d_A S1_2d_A S1_3d_A S1_1d_B S1_2d_B S1_3d_B S1_1d_C S1_2d_C S1_3d_C S2_1d_A S2_2d_A S2_3d_A S2_1d_B S2_2d_B S2_3d_B S2_1d_C S2_2d_C S2_3d_C
1657 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

The convention used for the name of the columns is the following: SX_R_C.

  • X the type of DataFrame for the corresponding index (each index has two DataFrame)
  • R : The corresponding row in the original DataFrame
  • C: The corresponding column in the original DataFrame

I would like to recreate the corresponding DataFrames for each rows.

So, for example, for the index 1657 I would like to obtain the two following DataFrames :

For S1 :

A B C
1d 1 4 7
2d 2 5 8
3d 3 6 9

For S2 :

A B C
1d 10 13 16
2d 11 14 17
3d 12 15 18

I could do it very easily with loops but the execution time would be too high with 1 million lines. Is there a way to do it easily without looping?

N.B.1: I think the method used to create my .csv file was similar to that one : Flatten DataFrame into a single row

N.B 2 : I have more than 3 rows and 3 columns in reality. It's around 10x10

Thank you for your help!

CodePudding user response:

You can rework the column index to MultiIndex, the rest is a simple reshaping (here using stack) and grouping to split the dataframe:

df.columns = pd.MultiIndex.from_arrays(zip(*df.columns.map(lambda x: x.split('_'))))
dfs = {k:g for k,g in df.stack(level=[0,1]).droplevel(0).groupby(level=0)}

output:

>>> dfs
{'S1':     A  B  C
 1d  1  4  7
 2d  2  5  8
 3d  3  6  9,
 'S2':      A   B   C
 1d  10  13  16
 2d  11  14  17
 3d  12  15  18}

>>> dfs['S1']
    A  B  C
1d  1  4  7
2d  2  5  8
3d  3  6  9

CodePudding user response:

First, convert the column headers into a MultiIndex with 3 levels by splitting the column labels.

>>> df.columns = df.columns.map(lambda col: tuple(col.split("_")))
>>> df

     S1                          S2                                
     1d 2d 3d 1d 2d 3d 1d 2d 3d  1d  2d  3d  1d  2d  3d  1d  2d  3d
      A  A  A  B  B  B  C  C  C   A   A   A   B   B   B   C   C   C
1657  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18

Then turn the first two column levels into index levels by using DataFrame.stack, and drop the old index label using DataFrame.droplevel

>>> df = df.stack(level=[0,1]).droplevel(0)
>>> df

        A   B   C
S1 1d   1   4   7
   2d   2   5   8
   3d   3   6   9
S2 1d  10  13  16
   2d  11  14  17
   3d  12  15  18

Finally, group the resulting DataFrame by the first index level (S1 and S2), and create a dictionary with the format {'S1': df_S1, 'S2': df_S2}.

>>> df_dict = {Si: group.droplevel(0) for Si, group in df.groupby(level=0)}
>>> df_dict['S1']

    A  B  C
1d  1  4  7
2d  2  5  8
3d  3  6  9

>>> df_dict['S2']

     A   B   C
1d  10  13  16
2d  11  14  17
3d  12  15  18
  • Related