Home > database >  Split a dataframe with multiple header rows into unique dataframes
Split a dataframe with multiple header rows into unique dataframes

Time:03-18

I have some annoying csv files with multiple headers of different lengths that look something like this:

data = {'Line': ['0', '0', 'Line', '0', '0'], 'Date': ['8/25/2021', '8/25/2021', 'Date', '8/25/2021', '8/25/2021'], 
        'LibraryFile':['PSI_210825_G2_ASD4_F.LIB','PSI_210825_G2_ASD4_F.LIB','LibraryFile','PSI_210825_G2_ASD3.LIB','PSI_210825_G2_ASD3.LIB']}
# Create DataFrame.
df = pd.DataFrame(data)
# Print the output.
print(df)

   Line       Date               LibraryFile
0     0  8/25/2021  PSI_210825_G2_ASD4_F.LIB
1     0  8/25/2021  PSI_210825_G2_ASD4_F.LIB
2  Line       Date               LibraryFile
3     0  8/25/2021    PSI_210825_G2_ASD3.LIB
4     0  8/25/2021    PSI_210825_G2_ASD3.LIB

Each "header" Lines has different column names after the LibraryFile column so what I want to do is split up the files at each "Line" row and retain that row as the new header with the data underneath it. I've tried look at options that use split functions but with no luck. Currently I'm trying to use the LibraryFile column which is unique for each chunk of data. I've tried using the pandas groupby function

grouped = df.groupby(df['LibraryFile'])
path_to_directory = 'filepath'
for lib in df['LibraryFile'].unique():
    temporary_df = grouped.get_group(lib)
    temporary_df.to_csv(f'filepath/temp.csv')

This gets me a chunk of data but I can't figure out how best to go from here to retain the "Line" row as the new header for all data chunks.

I also tried numpy:

dfs = np.split(df, np.flatnonzero(df[0] == 'Line'))
print(*dfs, sep='\n\n')

But this just throws an error. I'm unfortunately relearning Python after a long time of not using it so I'm sure there's solution I'm just ignorant of. Thank you

CodePudding user response:

Here's a solution the will split up the dataframes based on every occurence of the column names in the rows:

f = df.eq(df.columns)
groups = [g.reset_index(drop=True) for _, g in df[~f.iloc[:, 0]].groupby(f.cumsum()[~f.iloc[:, 0]].iloc[:, 0])]

Output:

>>> groups
[  Line       Date               LibraryFile
 0    0  8/25/2021  PSI_210825_G2_ASD4_F.LIB
 1    0  8/25/2021  PSI_210825_G2_ASD4_F.LIB,
   Line       Date             LibraryFile
 0    0  8/25/2021  PSI_210825_G2_ASD3.LIB
 1    0  8/25/2021  PSI_210825_G2_ASD3.LIB]
 
>>> groups[0]
  Line       Date               LibraryFile
0    0  8/25/2021  PSI_210825_G2_ASD4_F.LIB
1    0  8/25/2021  PSI_210825_G2_ASD4_F.LIB

>>> groups[1]
  Line       Date             LibraryFile
0    0  8/25/2021  PSI_210825_G2_ASD3.LIB
1    0  8/25/2021  PSI_210825_G2_ASD3.LIB

CodePudding user response:

You could use to_numeric isna cumsum on Line column and groupby with it. Line column should have numeric values but for the "header" rows, it doesn't so, to_numeric breaks and gives NaN, which could be used to group rows.

Then in a loop; modify each group to set the column name as the first row etc.

for k, d in df.groupby(pd.to_numeric(df['Line'], errors='coerce').isna().cumsum()):
    if d['Line'].iat[0] == 'Line':
        d = pd.DataFrame(d.iloc[1:], columns=d.iloc[0]).reset_index(drop=True).rename_axis(columns=[None])
#    d.to_csv(...)
    print(d, end='\n\n')

Output:

  Line       Date               LibraryFile
0    0  8/25/2021  PSI_210825_G2_ASD4_F.LIB
1    0  8/25/2021  PSI_210825_G2_ASD4_F.LIB

  Line       Date             LibraryFile
0    0  8/25/2021  PSI_210825_G2_ASD3.LIB
1    0  8/25/2021  PSI_210825_G2_ASD3.LIB
  • Related