the data which i have is in below format :
col_1 col_2 col_3
NaN NaN NaN
Date 21-04-2022 NaN
Id Name status
01 A11 Pass
02 A22 F_1
03 A33 P_2
SUMMARY 'Total :$20 Approved $ 10' NaN
NaN NaN NaN
Date 22-04-2022 NaN
Id Name status
04 A12 P_2
05 A23 F_1
06 A34 P_2
SUMMARY 'Total :$30 Approved $ 20' NaN
Expected Output : df_1 -
Id Name status
01 A11 Pass
02 A22 F_1
03 A33 P_2
SUMMARY 'Total :$20 Approved $ 10' NaN
df_2 -
Id Name status
04 A12 P_2
05 A23 F_1
06 A34 P_2
SUMMARY 'Total :$30 Approved $ 20' NaN
Above is just the sample data. Actual Number of columns which i have is around 24K. thus many number of df's will be created how it can be approached..?
CodePudding user response:
You can use:
grp = df['col_1'].eq('Id').cumsum() # create virtual groups
msk = ~df.isna().all(axis=1) & df['col_1'].ne('Date') # keep wanted rows
# create a dict with subset dataframes
dfs = {f'df{name}': pd.DataFrame(temp.values[1:], columns=temp.iloc[0].tolist())
for name, temp in df[msk].groupby(grp)}
Output:
>>> dfs['df1']
Id Name status
0 01 A11 Pass
1 02 A22 F_1
2 03 A33 P_2
3 SUMMARY Total :$20 Approved $ 10 NaN
>>> dfs['df2']
Id Name status
0 04 A12 P_2
1 05 A23 F_1
2 06 A34 P_2
3 SUMMARY Total :$30 Approved $ 20 NaN
Update: export to excel:
with pd.ExcelWriter('data.xlsx') as writer:
for name, temp in dfs.items():
temp.to_excel(writer, index=False, sheet_name=name)
CodePudding user response:
You could create an auxiliary column of booleans and use that to slice the dataframe into smaller parts:
import pandas as pd
df = pd.DataFrame({'col_1': [1,2,'Id',3,4,5,'SUMMARY',1,2,'Id',3,4,5,'SUMMARY']})
mask = df['col_1'].eq('Id') | df['col_1'].eq('SUMMARY').shift()
df['group_id'] = mask.cumsum()
dfs = list()
for group_id in df['group_id'].unique():
if group_id % 2 != 0:
dfs.append(df[df['group_id'].eq(group_id)])
print(dfs[0])
print(dfs[1])
CodePudding user response:
import numpy as np
df_1 = pd.DataFrame(data = np.array(df.iloc[3:7]),columns=np.array(df.iloc[2:3])[0])
df_2 = pd.DataFrame(data = np.array(df.iloc[11:15]),columns=np.array(df.iloc[10:11])[0])
CodePudding user response:
Highly inspired by piRSquared's answer here, you can approach your goal like this :
import pandas as pd
import numpy as np
df.columns = ["Id", "Name", "Status"]
# is the row a Margin ?
m = df["Id"].eq("SUMMARY")
l_df = list(filter(lambda d: not d.empty, np.split(df, np.flatnonzero(m) 1)))
_ = [exec(f"globals()['df_{idx}'] = df.reset_index(drop=True) \
.loc[3:].reset_index(drop=True)")
for idx, df in enumerate(l_df, start=1)]
NB : We used globals
to create the variables/sub-dataframes dynamically.
# Output :
print(len(l_df), "DataFrames was created!")
2 DataFrames was created!
print(df_1, type(df_1)), print(df_2, type(df_2)))
Id Name Status
0 04 A12 P_2
1 05 A23 F_1
2 06 A34 P_2
3 SUMMARY 'Total :$30 Approved $ 20' NaN <class 'pandas.core.frame.DataFrame'>
Id Name Status
0 01 A11 Pass
1 02 A22 F_1
2 03 A33 P_2
3 SUMMARY 'Total :$20 Approved $ 10' NaN <class 'pandas.core.frame.DataFrame'>