Home > Back-end >  Extract multiple dataframes from single .txt file
Extract multiple dataframes from single .txt file

Time:08-10

I have a file that contains several dataframes (each dataframe has its own variables) I want to extract all dataframes from this file and save them in separate files Is there a way to do this with pandas?

df.txt

   col1,  col2,
0,     0,   NaN,
1,     1,   NaN,
2,     2,   2.0,
3,     3,   3.0,
   col3,  col4, col5,
4,     0,   1, 0,
5,     1,   NaN, 1,
   col6,  col7,  
6,     0,   NaN, 
7,     1,   3, 

Desired result :

df1.csv

   col1,  col2,
0,     0,   NaN,
1,     1,   NaN,
2,     2,   2.0,
3,     3,   3.0,

df2.csv

   col3,  col4, col5,
4,     0,   1, 0,
5,     1,   NaN, 1,

df3.csv

   col6,  col7,  
6,     0,   NaN, 
7,     1,   3, 

CodePudding user response:

There's no way to do this in pandas directly but you can easily do this if you know how to recognize the starting line of a new dataframe within the file. If the file really looks like you posted and a new dataframe starts with some spaces (could be a tab as well) then the following code will create new files for each of the dataframes within the file.

import re


def create_csv(index, lines):
    with open(f"dfs{index}.csv", "w") as new_file:
        new_file.writelines(lines)


if __name__ == "__main__":
    with open("df.txt", "r") as f:
        lines = f.readlines()
        df_start_indexes = []
        for i, l in enumerate(lines):
            if re.match(r"\s", l):
                df_start_indexes.append(i)

        print(df_start_indexes)
        for df_i, start_line in enumerate(df_start_indexes):
            end_line = df_start_indexes[df_i   1] if len(df_start_indexes) > df_i   1 else None
            create_csv(df_i, lines[start_line:end_line])

CodePudding user response:

df = pd.read_csv('df.txt', header=None)

csv_starts = df.loc[df[0].str.strip().str.startswith('col')].index.to_list()
for i,(a,b) in enumerate(zip(csv_starts, csv_starts[1:] [None])):
    df.iloc[a:b].to_csv(f'df{i}.csv', index=False, header=False)
  1. Read the whole file as csv, but there are no commas in there, so it will be a 1-column DataFrame:

    df = pd.read_csv('df.txt', header=None)

  2. Get the indices of new-file breaks - the rows with col-names in them:

    csv_starts = df.loc[df[0].str.strip().str.startswith('col')].index.to_list()

  3. Build index-ranges and save them to file:

    for i,(a,b) in enumerate(zip(csv_starts, csv_starts[1:] [None])): df.iloc[a:b].to_csv(f'df{i}.csv', index=False, header=False)

  • Related