Should I use Python pandas or write custom code in C for reading and filtering data from a multi-gig-CodePudding

I have a 13 GB CSV file and I need to read the file and filter data from it. I am using pandas and reading it in chunks, but it takes too long. Are there any other libraries in python that are faster than pandas or writing custom code in C would be a better option?

I am using following code:

input_df=pd.read_csv("input file",chunksize=60000)
frames=[]
for i in input_df:
    filter_df=i[i["Column1"].str.contains("given string")|i["column2"].str.contains("given string")|i["column3"].str.contains("given string")]
    frame=pd.DataFrame(filter_df)
    frames.append(frame)
output_df=pd.concat(frames)
output_df.to_csv('output.csv',index=False)

I have 8 GB ram so have to read data in chunks.

CodePudding user response：

Pandas and Numpy are built using C, so I don't see how you will gain a better speed even if you code in pure C instead writing bad C code might mess it up even further.
Try focusing on improving your algorithm or the way you are currently reading it.

if all you want to do is read a CSV file and filter data based on whether it contains a certain string then I think reading line by line would be a better approach.

# store your results here
result = {"col1":[], "col2":[], "col3":[]}
to_check = "some string"
reset_after = 1000
current_line = 0

fp = open('filename.csv', 'r')
while ((line := fp.readline()) != ''):
    current_line  = 1

    # Now, create a dataframe out of current result dictionary and save it
    df = DataFrame(result)
    df.to_csv("result_file.csv", mode='a', index=False, header=False)

    # reset after saving every reset_after line has reached
    if current_line >= reset_after:
        result = {"col1":[], "col2":[], "col3":[]}

    val1, val2, val3 = line.split(",")
    if (col1 == to_check) or (col2 == to_check) or (col3 == to_check):
        result['col1'].append(val1)
        result['col2'].append(val2)
        result['col3'].append(val3)