I have a 13 GB CSV file and I need to read the file and filter data from it. I am using pandas and reading it in chunks, but it takes too long. Are there any other libraries in python that are faster than pandas or writing custom code in C would be a better option?
I am using following code:
input_df=pd.read_csv("input file",chunksize=60000)
frames=[]
for i in input_df:
filter_df=i[i["Column1"].str.contains("given string")|i["column2"].str.contains("given string")|i["column3"].str.contains("given string")]
frame=pd.DataFrame(filter_df)
frames.append(frame)
output_df=pd.concat(frames)
output_df.to_csv('output.csv',index=False)
I have 8 GB ram so have to read data in chunks.
CodePudding user response:
Pandas and Numpy are built using C, so I don't see how you will gain a better speed even if you code in pure C instead writing bad C code might mess it up even further.
Try focusing on improving your algorithm or the way you are currently reading it.
if all you want to do is read a CSV file and filter data based on whether it contains a certain string then I think reading line by line would be a better approach.
# store your results here
result = {"col1":[], "col2":[], "col3":[]}
to_check = "some string"
reset_after = 1000
current_line = 0
fp = open('filename.csv', 'r')
while ((line := fp.readline()) != ''):
current_line = 1
# Now, create a dataframe out of current result dictionary and save it
df = DataFrame(result)
df.to_csv("result_file.csv", mode='a', index=False, header=False)
# reset after saving every reset_after line has reached
if current_line >= reset_after:
result = {"col1":[], "col2":[], "col3":[]}
val1, val2, val3 = line.split(",")
if (col1 == to_check) or (col2 == to_check) or (col3 == to_check):
result['col1'].append(val1)
result['col2'].append(val2)
result['col3'].append(val3)