Single row (from flatten DataFrame) into a DataFrame in Python-CodePudding

I have a .csv file containing one million of rows. Each row corresponds to two flatten DataFrames and the index correspond to a unique attribute.

For example in my .csv file my the first two rows are like :

	S1_1d_A	S1_2d_A	S1_3d_A	S1_1d_B	S1_2d_B	S1_3d_B	S1_1d_C	S1_2d_C	S1_3d_C	S2_1d_A	S2_2d_A	S2_3d_A	S2_1d_B	S2_2d_B	S2_3d_B	S2_1d_C	S2_2d_C	S2_3d_C
1657	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18

The convention used for the name of the columns is the following: SX_R_C.

X the type of DataFrame for the corresponding index (each index has two DataFrame)
R : The corresponding row in the original DataFrame
C: The corresponding column in the original DataFrame

I would like to recreate the corresponding DataFrames for each rows.

So, for example, for the index 1657 I would like to obtain the two following DataFrames :

For S1 :

	A	B	C
1d	1	4	7
2d	2	5	8
3d	3	6	9

For S2 :

	A	B	C
1d	10	13	16
2d	11	14	17
3d	12	15	18

I could do it very easily with loops but the execution time would be too high with 1 million lines. Is there a way to do it easily without looping?

N.B.1: I think the method used to create my .csv file was similar to that one : Flatten DataFrame into a single row

N.B 2 : I have more than 3 rows and 3 columns in reality. It's around 10x10

Thank you for your help!

CodePudding user response：

You can rework the column index to MultiIndex, the rest is a simple reshaping (here using stack) and grouping to split the dataframe:

df.columns = pd.MultiIndex.from_arrays(zip(*df.columns.map(lambda x: x.split('_'))))
dfs = {k:g for k,g in df.stack(level=[0,1]).droplevel(0).groupby(level=0)}

output:

>>> dfs
{'S1':     A  B  C
 1d  1  4  7
 2d  2  5  8
 3d  3  6  9,
 'S2':      A   B   C
 1d  10  13  16
 2d  11  14  17
 3d  12  15  18}

>>> dfs['S1']
    A  B  C
1d  1  4  7
2d  2  5  8
3d  3  6  9

CodePudding user response：

First, convert the column headers into a MultiIndex with 3 levels by splitting the column labels.

>>> df.columns = df.columns.map(lambda col: tuple(col.split("_")))
>>> df

     S1                          S2                                
     1d 2d 3d 1d 2d 3d 1d 2d 3d  1d  2d  3d  1d  2d  3d  1d  2d  3d
      A  A  A  B  B  B  C  C  C   A   A   A   B   B   B   C   C   C
1657  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18

Then turn the first two column levels into index levels by using DataFrame.stack, and drop the old index label using DataFrame.droplevel

>>> df = df.stack(level=[0,1]).droplevel(0)
>>> df

        A   B   C
S1 1d   1   4   7
   2d   2   5   8
   3d   3   6   9
S2 1d  10  13  16
   2d  11  14  17
   3d  12  15  18

Finally, group the resulting DataFrame by the first index level (S1 and S2), and create a dictionary with the format {'S1': df_S1, 'S2': df_S2}.

>>> df_dict = {Si: group.droplevel(0) for Si, group in df.groupby(level=0)}
>>> df_dict['S1']

    A  B  C
1d  1  4  7
2d  2  5  8
3d  3  6  9

>>> df_dict['S2']

     A   B   C
1d  10  13  16
2d  11  14  17
3d  12  15  18