I have a python script I am using to convert some very densely formatted csv files into another format that I need. The csv files are quite large (3GB) so I read them in chunks to avoid using all the RAM (I have 32GB of RAM on the machine I am using).
The odd thing that happens in that the script processes one file using a few GB of memory (about ~3GB based on what top says).
I finish that file and load the next file, again in chunks. Suddenly I am using 25GB, writing to swap, and the process is killed. I'm not sure what is changing between the first and second iteration. I have put in an os.sleep(60) to try to let the garbage collector catch up but it still is going from ~10% memory to ~85% to killed process.
Here's the main chunk of the script:
for file in files:
sleep(60)
print(file)
read_names = True
count = 0
for df in pd.read_csv(file, encoding= 'unicode_escape', chunksize=1e4, names=['all']):
start_index = 0
count = 1
if read_names:
names = df.iloc[0,:].apply(lambda x: x.split(';')).values[0]
names = names[1:]
start_index = 2
read_names = False
for row in df.iloc[start_index:,:].iterrows():
data = row[1]
data_list = data['all'].split(';')
date_time = data_list[0]
values = data_list[1:]
date, time = date_time.split(' ')
dd, mm, yyyy = date.split('/')
date = yyyy '/' mm '/' dd
for name, value in zip(names, values):
try:
data_dict[name].append([name, date, time, float(value)])
except:
pass
if count % 5 == 0:
for name in names:
start_date = data_dict[name][0][1]
start_time = data_dict[name][0][2]
end_date = data_dict[name][-1][1]
end_time = data_dict[name][-1][2]
start_dt = start_date ' ' start_time
end_dt = end_date ' ' end_time
dt_index = pd.date_range(start=start_dt, freq='1S', periods=len(data_dict[name]))
df = pd.DataFrame(data_dict[name], index=dt_index)
df = df[3].resample('1T').mean().round(10)
with open(csv_dict[name], 'a') as ff:
for index, value in zip(df.index, df.values):
date, time = str(index).split(' ')
to_write = f"{name}, {date}, {time}, {value}\n"
ff.write(to_write)
Is there something I can do to manage this better? I need to loop over 50 large files for this task.
Data format: Input
time sensor1 sensor2 sensor3 sensor....
2022-07-01 00:00:00; 2.559;.234;0;0;0;.....
2022-07-01 00:00:01; 2.560;.331;0;0;0;.....
2022-07-01 00:00:02; 2.558;.258;0;0;0;.....
Output
sensor1, 2019-05-13, 05:58:00, 2.559
sensor1, 2019-05-13, 05:59:00, 2.560
sensor1, 2019-05-13, 06:00:00, 2.558
Edit: interesting finding - the files I am writing to are suddenly not being updated, they are several minutes behind where they should be if writing is occurring as it should. The data within the file is not changing either when I check the tail of the file. Thus I assume the data is building up in the dictionary and swamping RAM, which makes sense. Now to understand why the writing isn't happening.
Edit 2: more interesting finds!! The script runs fine of the first csv and a big chunk of the second csv before filling up the RAM and crashing. It seems the ram problem starts with the second file, so I skipped processing that one and magically I am running longer than I have thus far without a memory issue. This perhaps is corrupt data that throws something off.
CodePudding user response:
Given file.csv
that looks exactly like:
time sensor1 sensor2 sensor3 sensor4 sensor5
2022-07-01 00:00:00; 2.559;.234;0;0;0
2022-07-01 00:00:01; 2.560;.331;0;0;0
2022-07-01 00:00:02; 2.558;.258;0;0;0
You're doing a lot more than this, and not using proper pandas
methods will kill you on time (iterrows
is basically never the best option). Basically, if you're manually looping over a DataFrame, you're probably doing it wrong.
But, if you follow this pattern of using it as a context manager, instead of trying to treat it as an iterator (which is a deprecated method), you won't have the memory issues.
files = ['file.csv']
for file in files:
with open(file) as f:
# Grab the columns:
cols = f.readline().split()
# Initialize the context-manager:
# You'll want a larger chunksize, 1e5 should even work.
with pd.read_csv(f, names=cols, sep=';', chunksize=1) as chunks:
for df in chunks:
df[['date', 'time']] = df.time.str.split(expand=True)
df = df.melt(['date', 'time'], var_name='sensor')
df = df[['sensor', 'date', 'time', 'value']]
df.to_csv(f'new_{file}', mode='a', index=False, header=False)
Output of new_file.csv
:
sensor1,2022-07-01,00:00:00,2.559
sensor2,2022-07-01,00:00:00,0.234
sensor3,2022-07-01,00:00:00,0.0
sensor4,2022-07-01,00:00:00,0.0
sensor5,2022-07-01,00:00:00,0.0
sensor1,2022-07-01,00:00:01,2.56
sensor2,2022-07-01,00:00:01,0.331
sensor3,2022-07-01,00:00:01,0.0
sensor4,2022-07-01,00:00:01,0.0
sensor5,2022-07-01,00:00:01,0.0
sensor1,2022-07-01,00:00:02,2.558
sensor2,2022-07-01,00:00:02,0.258
sensor3,2022-07-01,00:00:02,0.0
sensor4,2022-07-01,00:00:02,0.0
sensor5,2022-07-01,00:00:02,0.0