Optimize processing of large CSV file Python-CodePudding

I have a CSV file of about 175 millions lines (2.86 GB), composed of three columns as shown below :

I need to get the value in column "val" given "ID1" and "ID2". I query this dataframe constantly with varying combination of ID1 and ID2, which are unique in the whole file.

I have tried to use pandas as shown below, but results are taking a lot of time.

def is_av(Qterm, Cterm, df):
    try:
        return df.loc[(Qterm, Cterm),'val']
    except KeyError:
        return 0

Is there a faster way to access CSV values, knowing that this value is located in one single row of the whole file. If not could you check this function and tell me what might be the issue of slow processing

for nc in L:#ID1
    score = 0.0
    for ni in id_list:#ID2
        e = is_av(ni,nc,df_g)
        InDegree = df1.loc[ni].values[0]
        SumInMap = df2.loc[nc].values[0]
        score = score   term_score(InDegree, SumInMap, e) #compute a score 
    key  = pd_df3.loc[nc].values[0]
    tmt[key] = score

CodePudding user response：

 import os
 os.system('pip install dask')
 import dask.dataframe as dd
 dd_data = dd.read_csv('sample.csv')
 bool_filter_conditions = (dd_data['ID1'] == 'a') & (dd_data['ID2'] == 'b')
 dd_result = dd_data[bool_filter_conditions][['val']]
 dd_output = dd_result.compute()
 dd_output

CodePudding user response：

TL;DR: Use a DBMS (I suggest MySQL or PostgreSQL). Pandas is definitely not suited for this sort of work. Dask is better, but not as good as a traditional DBMS.

The absolute best way of doing this would be to use SQL, consider MySQL or PostgreSQL for starters (both free and very efficient alternatives for your current use case). While Pandas is an incredibly strong library, when it comes to indexing and quick reading, this is not something it excels at, given that it needs to either load data into memory, or stream over the data with little control compared to a DBMS.

Consider your use case where you have multiple values and you want to skip specific rows, let's say you're looking for (ID1, ID2) with values of (3108, 4813). You want to skip over every row that starts with anything other than 3, then anything other than 31, and so on, and then skip any row starting with anything other than 3108,4 (assuming your csv delimiter is a comma), and so on until you get exactly the ID1 and ID2 you're looking for, this is reading the data at a character level.

Pandas does not allow you to do this (as far as I know, someone can correct this response if it does). The other example uses Dask, which is a library designed by default to handle data much larger than the RAM at scale, but is not optimized for index management as DBMS's are. Don't get me wrong, Dask is good, but not for your use case.

Another very basic alternative would be to index your data based on ID1 and ID2, store them indexed, and only look up your data through actual file reading by skipping lines that do not start with your designated ID1, and then skipping lines that do not start with your ID2, and so on, however, the best practice would be to use a DBMS, as caching, read optimization, among many other serious pros would be available; reducing the I/O read time from your disk.

You can get started with MySQL here: https://dev.mysql.com/doc/mysql-getting-started/en/

You can get started with PostgreSQL here: https://www.postgresqltutorial.com/postgresql-getting-started/