I have a folder /data/csvs
which contains ~7000 CSV files each with ~600 lines. Each CSV has a name which contains a timestamp that needs to be preserved e.g. /data/csvs/261121.csv
, /data/csvs/261122.csv
(261121
being 26/11/21 today's date).
I need to:
- Load each CSV.
- Add a column in which the timestamp can be saved so I know which file the data came from. The time increases by half a second each row so this row also shows the hour/minute/second/microseconds.
- Combine the rows into one table which will span a month of data.
- Ideally I'd like the final product to be a DataFrame.
Currently this is what I'm doing:
files = os.listdir('/data/csvs')
csv_names = []
for file_name in files:
if file_name[-4:] == '.csv':
csv_names.append(file_name)
to_process = len(csv_names)
for i, csv_name in enumerate(csv_names):
df = pd.read_csv(f'{csv_folder_path}/{file_name}')
df = timestamp(df, csv_name)
to_process = to_process-1
if i == 0:
concat_df = df
concat_df.to_feather(path=processed_path)
else:
concat_df = pd.concat([concat_df, df])
if to_process % 100 == 0:
saved_df = pd.read_feather(path=processed_path)
concat_df = pd.concat([saved_df, concat_df])
concat_df.reset_index(drop=True, inplace=True)
concat_df.to_feather(path=processed_path)
I'm loading in each CSV as a DataFrame, adding the timestamp column and concatenating the CSVs 100 at a time (because I thought this would reduce memory usage) and then saving 100 CSVs at a time to a large DataFrame feather file. This is really slow and uses loads of memory.
What is a more efficient way of doing this?
CodePudding user response:
First, you could be more efficient loading your files using glob
. This saves you iterating over all the files and checking whether the file-extension is ".csv"
import glob
src = '/data/csvs'
files = glob.iglob(os.path.join(src, "*.csv"))
Then, read all files into a df and add them to a generator, in the same step assigning the basename of the file to a column named timestamp
df_from_each_file = (pd.read_csv(f).assign(timestamp=os.path.basename(f).split('.')[0]) for f in files)
And finally concatenate the dfs into one
csv_data = pd.concat(df_from_each_file, ignore_index=True)
Hope this helped! I have used a process like this for large amounts of data and found it efficient enough.