I have a .csv file containing one million of rows. Each row corresponds to two flatten DataFrames and the index correspond to a unique attribute.
For example in my .csv file my the first two rows are like :
S1_1d_A | S1_2d_A | S1_3d_A | S1_1d_B | S1_2d_B | S1_3d_B | S1_1d_C | S1_2d_C | S1_3d_C | S2_1d_A | S2_2d_A | S2_3d_A | S2_1d_B | S2_2d_B | S2_3d_B | S2_1d_C | S2_2d_C | S2_3d_C | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1657 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 |
The convention used for the name of the columns is the following: SX_R_C.
- X the type of DataFrame for the corresponding index (each index has two DataFrame)
- R : The corresponding row in the original DataFrame
- C: The corresponding column in the original DataFrame
I would like to recreate the corresponding DataFrames for each rows.
So, for example, for the index 1657 I would like to obtain the two following DataFrames :
For S1 :
A | B | C | |
---|---|---|---|
1d | 1 | 4 | 7 |
2d | 2 | 5 | 8 |
3d | 3 | 6 | 9 |
For S2 :
A | B | C | |
---|---|---|---|
1d | 10 | 13 | 16 |
2d | 11 | 14 | 17 |
3d | 12 | 15 | 18 |
I could do it very easily with loops but the execution time would be too high with 1 million lines. Is there a way to do it easily without looping?
N.B.1: I think the method used to create my .csv file was similar to that one : Flatten DataFrame into a single row
N.B 2 : I have more than 3 rows and 3 columns in reality. It's around 10x10
Thank you for your help!
CodePudding user response:
You can rework the column index to MultiIndex, the rest is a simple reshaping (here using stack
) and grouping to split the dataframe:
df.columns = pd.MultiIndex.from_arrays(zip(*df.columns.map(lambda x: x.split('_'))))
dfs = {k:g for k,g in df.stack(level=[0,1]).droplevel(0).groupby(level=0)}
output:
>>> dfs
{'S1': A B C
1d 1 4 7
2d 2 5 8
3d 3 6 9,
'S2': A B C
1d 10 13 16
2d 11 14 17
3d 12 15 18}
>>> dfs['S1']
A B C
1d 1 4 7
2d 2 5 8
3d 3 6 9
CodePudding user response:
First, convert the column headers into a MultiIndex with 3 levels by splitting the column labels.
>>> df.columns = df.columns.map(lambda col: tuple(col.split("_")))
>>> df
S1 S2
1d 2d 3d 1d 2d 3d 1d 2d 3d 1d 2d 3d 1d 2d 3d 1d 2d 3d
A A A B B B C C C A A A B B B C C C
1657 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Then turn the first two column levels into index levels by using DataFrame.stack
, and drop the old index label using DataFrame.droplevel
>>> df = df.stack(level=[0,1]).droplevel(0)
>>> df
A B C
S1 1d 1 4 7
2d 2 5 8
3d 3 6 9
S2 1d 10 13 16
2d 11 14 17
3d 12 15 18
Finally, group the resulting DataFrame by the first index level (S1 and S2), and create a dictionary with the format {'S1': df_S1, 'S2': df_S2}
.
>>> df_dict = {Si: group.droplevel(0) for Si, group in df.groupby(level=0)}
>>> df_dict['S1']
A B C
1d 1 4 7
2d 2 5 8
3d 3 6 9
>>> df_dict['S2']
A B C
1d 10 13 16
2d 11 14 17
3d 12 15 18