I have a file that contains several dataframes (each dataframe has its own variables) I want to extract all dataframes from this file and save them in separate files Is there a way to do this with pandas?
df.txt
col1, col2,
0, 0, NaN,
1, 1, NaN,
2, 2, 2.0,
3, 3, 3.0,
col3, col4, col5,
4, 0, 1, 0,
5, 1, NaN, 1,
col6, col7,
6, 0, NaN,
7, 1, 3,
Desired result :
df1.csv
col1, col2,
0, 0, NaN,
1, 1, NaN,
2, 2, 2.0,
3, 3, 3.0,
df2.csv
col3, col4, col5,
4, 0, 1, 0,
5, 1, NaN, 1,
df3.csv
col6, col7,
6, 0, NaN,
7, 1, 3,
CodePudding user response:
There's no way to do this in pandas directly but you can easily do this if you know how to recognize the starting line of a new dataframe within the file. If the file really looks like you posted and a new dataframe starts with some spaces (could be a tab as well) then the following code will create new files for each of the dataframes within the file.
import re
def create_csv(index, lines):
with open(f"dfs{index}.csv", "w") as new_file:
new_file.writelines(lines)
if __name__ == "__main__":
with open("df.txt", "r") as f:
lines = f.readlines()
df_start_indexes = []
for i, l in enumerate(lines):
if re.match(r"\s", l):
df_start_indexes.append(i)
print(df_start_indexes)
for df_i, start_line in enumerate(df_start_indexes):
end_line = df_start_indexes[df_i 1] if len(df_start_indexes) > df_i 1 else None
create_csv(df_i, lines[start_line:end_line])
CodePudding user response:
df = pd.read_csv('df.txt', header=None)
csv_starts = df.loc[df[0].str.strip().str.startswith('col')].index.to_list()
for i,(a,b) in enumerate(zip(csv_starts, csv_starts[1:] [None])):
df.iloc[a:b].to_csv(f'df{i}.csv', index=False, header=False)
Read the whole file as
csv
, but there are no commas in there, so it will be a 1-column DataFrame:df = pd.read_csv('df.txt', header=None)
Get the indices of new-file breaks - the rows with
col
-names in them:csv_starts = df.loc[df[0].str.strip().str.startswith('col')].index.to_list()
Build index-ranges and save them to file:
for i,(a,b) in enumerate(zip(csv_starts, csv_starts[1:] [None])): df.iloc[a:b].to_csv(f'df{i}.csv', index=False, header=False)