I have thousands csv files names as follows file_x_x.csv where x is a number between 1 and 10000, in a same folder. Each file includes a header and one row of data:
file_1_1.csv
Name Surname Age Address
Michael O'Donnel 22 George St.
file_2_2.csv
Name Surname Age Address
Mary Jane 34 Camden St.
and so on.
I am looking for creating one single file including all these rows:
final_file.csv
Name Surname Age Address
Michael O'Donnel 22 George St.
Mary Jane 34 Camden St.
...
My approach:
import pandas as pd
import glob
path = # add path
all_files = glob.glob(path ".csv") # look for all the csv files in that folder. Probably this is not the right code for looking at them
file_list = []
for filename in all_files:
df = pd.read_csv(filename)
file_list(df)
I do not know how to create one unique file at the end. Can you have a look at the code above and tell me how to get the desired output and if I missed something?
CodePudding user response:
You don't need to do anything complicated here. You know the header line and you know that you want the final to be everything except the header. Just open the files, skip the first line, and write. This is far more efficient than the memory consumption of a bunch of dataframes in memory.
import glob
with open("final_file.csv", "w") as outfile:
for count, filename in enumerate(glob.glob(path ".csv")):
with open(filename) as infile:
header = next(infile)
if count == 0:
outfile.write(header)
line = next(infile)
if not line.startswith("\n"):
line = line "\n"
outfile.write(line)
CodePudding user response:
I'd suggest using pd.concat to combine the DataFrames into one large DataFrame, which you can then save to a different file if you wish.
Before you concatenate the DataFrames, you may have to modify the call to pd.read_csv
to ensure that the data is being processed correctly. If the example data in your question matches the contents of the CSV file verbatim, then the code snippet would look something like this:
import pandas as pd
import glob
path = "/my_path" # set this to the folder containing CSVs
names = glob.glob(path "*.csv") # get names of all CSV files under path
# If your CSV files use commas to split fields, then the sep
# argument can be ommitted or set to ","
file_list = pd.concat([pd.read_csv(filename, sep=" ") for filename in names])
#save the DataFrame to a file
file_list.to_csv("combined_data.csv")
Note that each row in the combined index will still be indexed based on the row number in its source file, creating duplicate row indices. To change that, call pd.DataFrame.reset_index()
file_list = file_list.reset_index(drop=True)