I have some annoying csv files with multiple headers of different lengths that look something like this:
data = {'Line': ['0', '0', 'Line', '0', '0'], 'Date': ['8/25/2021', '8/25/2021', 'Date', '8/25/2021', '8/25/2021'],
'LibraryFile':['PSI_210825_G2_ASD4_F.LIB','PSI_210825_G2_ASD4_F.LIB','LibraryFile','PSI_210825_G2_ASD3.LIB','PSI_210825_G2_ASD3.LIB']}
# Create DataFrame.
df = pd.DataFrame(data)
# Print the output.
print(df)
Line Date LibraryFile
0 0 8/25/2021 PSI_210825_G2_ASD4_F.LIB
1 0 8/25/2021 PSI_210825_G2_ASD4_F.LIB
2 Line Date LibraryFile
3 0 8/25/2021 PSI_210825_G2_ASD3.LIB
4 0 8/25/2021 PSI_210825_G2_ASD3.LIB
Each "header" Lines has different column names after the LibraryFile column so what I want to do is split up the files at each "Line" row and retain that row as the new header with the data underneath it. I've tried look at options that use split functions but with no luck. Currently I'm trying to use the LibraryFile column which is unique for each chunk of data. I've tried using the pandas groupby function
grouped = df.groupby(df['LibraryFile'])
path_to_directory = 'filepath'
for lib in df['LibraryFile'].unique():
temporary_df = grouped.get_group(lib)
temporary_df.to_csv(f'filepath/temp.csv')
This gets me a chunk of data but I can't figure out how best to go from here to retain the "Line" row as the new header for all data chunks.
I also tried numpy:
dfs = np.split(df, np.flatnonzero(df[0] == 'Line'))
print(*dfs, sep='\n\n')
But this just throws an error. I'm unfortunately relearning Python after a long time of not using it so I'm sure there's solution I'm just ignorant of. Thank you
CodePudding user response:
Here's a solution the will split up the dataframes based on every occurence of the column names in the rows:
f = df.eq(df.columns)
groups = [g.reset_index(drop=True) for _, g in df[~f.iloc[:, 0]].groupby(f.cumsum()[~f.iloc[:, 0]].iloc[:, 0])]
Output:
>>> groups
[ Line Date LibraryFile
0 0 8/25/2021 PSI_210825_G2_ASD4_F.LIB
1 0 8/25/2021 PSI_210825_G2_ASD4_F.LIB,
Line Date LibraryFile
0 0 8/25/2021 PSI_210825_G2_ASD3.LIB
1 0 8/25/2021 PSI_210825_G2_ASD3.LIB]
>>> groups[0]
Line Date LibraryFile
0 0 8/25/2021 PSI_210825_G2_ASD4_F.LIB
1 0 8/25/2021 PSI_210825_G2_ASD4_F.LIB
>>> groups[1]
Line Date LibraryFile
0 0 8/25/2021 PSI_210825_G2_ASD3.LIB
1 0 8/25/2021 PSI_210825_G2_ASD3.LIB
CodePudding user response:
You could use to_numeric
isna
cumsum
on Line
column and groupby
with it. Line
column should have numeric values but for the "header" rows, it doesn't so, to_numeric
breaks and gives NaN, which could be used to group rows.
Then in a loop; modify each group to set the column name as the first row etc.
for k, d in df.groupby(pd.to_numeric(df['Line'], errors='coerce').isna().cumsum()):
if d['Line'].iat[0] == 'Line':
d = pd.DataFrame(d.iloc[1:], columns=d.iloc[0]).reset_index(drop=True).rename_axis(columns=[None])
# d.to_csv(...)
print(d, end='\n\n')
Output:
Line Date LibraryFile
0 0 8/25/2021 PSI_210825_G2_ASD4_F.LIB
1 0 8/25/2021 PSI_210825_G2_ASD4_F.LIB
Line Date LibraryFile
0 0 8/25/2021 PSI_210825_G2_ASD3.LIB
1 0 8/25/2021 PSI_210825_G2_ASD3.LIB