Breaking the dataframe when particular string is found and creating multiple dataframes from the sam-CodePudding

the data which i have is in below format :

col_1         col_2                            col_3

NaN            NaN                              NaN
Date         21-04-2022                         NaN
Id            Name                            status
01            A11                              Pass
02            A22                              F_1
03            A33                              P_2
SUMMARY    'Total :$20  Approved $ 10'         NaN
NaN            NaN                             NaN
Date         22-04-2022                        NaN
Id            Name                           status
04            A12                              P_2
05            A23                              F_1
06            A34                              P_2
SUMMARY    'Total :$30  Approved $ 20'         NaN

Expected Output : df_1 -

Id            Name                            status
01            A11                              Pass
02            A22                              F_1
03            A33                              P_2
SUMMARY    'Total :$20  Approved $ 10'         NaN

df_2 -

Id            Name                           status
04            A12                              P_2
05            A23                              F_1
06            A34                              P_2
SUMMARY    'Total :$30  Approved $ 20'         NaN

Above is just the sample data. Actual Number of columns which i have is around 24K. thus many number of df's will be created how it can be approached..?

CodePudding user response：

You can use:

grp = df['col_1'].eq('Id').cumsum()  # create virtual groups
msk = ~df.isna().all(axis=1) & df['col_1'].ne('Date')  # keep wanted rows

# create a dict with subset dataframes
dfs = {f'df{name}': pd.DataFrame(temp.values[1:], columns=temp.iloc[0].tolist()) 
           for name, temp in df[msk].groupby(grp)}

Output:

>>> dfs['df1']
        Id                       Name status
0       01                        A11   Pass
1       02                        A22    F_1
2       03                        A33    P_2
3  SUMMARY  Total :$20  Approved $ 10    NaN

>>> dfs['df2']
        Id                       Name status
0       04                        A12    P_2
1       05                        A23    F_1
2       06                        A34    P_2
3  SUMMARY  Total :$30  Approved $ 20    NaN

Update: export to excel:

with pd.ExcelWriter('data.xlsx') as writer:
    for name, temp in dfs.items():
        temp.to_excel(writer, index=False, sheet_name=name)

CodePudding user response：

You could create an auxiliary column of booleans and use that to slice the dataframe into smaller parts:

import pandas as pd
df = pd.DataFrame({'col_1': [1,2,'Id',3,4,5,'SUMMARY',1,2,'Id',3,4,5,'SUMMARY']})

mask = df['col_1'].eq('Id') | df['col_1'].eq('SUMMARY').shift()
df['group_id'] = mask.cumsum()
dfs = list()
for group_id in df['group_id'].unique():
    if group_id % 2 != 0:
        dfs.append(df[df['group_id'].eq(group_id)])

print(dfs[0])
print(dfs[1])

CodePudding user response：

import numpy as np

df_1 = pd.DataFrame(data = np.array(df.iloc[3:7]),columns=np.array(df.iloc[2:3])[0])
df_2 = pd.DataFrame(data = np.array(df.iloc[11:15]),columns=np.array(df.iloc[10:11])[0])

CodePudding user response：

Highly inspired by piRSquared's answer here, you can approach your goal like this :

import pandas as pd
import numpy as np

df.columns = ["Id", "Name", "Status"]

# is the row a Margin ?
m = df["Id"].eq("SUMMARY")

l_df = list(filter(lambda d: not d.empty, np.split(df, np.flatnonzero(m)   1)))

_ = [exec(f"globals()['df_{idx}'] = df.reset_index(drop=True) \
                                      .loc[3:].reset_index(drop=True)")
     for idx, df in enumerate(l_df, start=1)]

NB : We used globals to create the variables/sub-dataframes dynamically.

# Output :

print(len(l_df), "DataFrames was created!")
2 DataFrames was created!

print(df_1, type(df_1)), print(df_2, type(df_2)))
    
        Id                         Name Status
0       04                          A12    P_2
1       05                          A23    F_1
2       06                          A34    P_2
3  SUMMARY   'Total :$30 Approved $ 20'    NaN <class 'pandas.core.frame.DataFrame'>

        Id                         Name Status
0       01                          A11   Pass
1       02                          A22    F_1
2       03                          A33    P_2
3  SUMMARY   'Total :$20 Approved $ 10'    NaN <class 'pandas.core.frame.DataFrame'>