Home > Net >  Read specific ID's from a large file with Python
Read specific ID's from a large file with Python

Time:09-08

I am trying to read some specific ids from a large file 5GB. For Ex below is the sample file having columns ['ID','Fruit Name']

ID     Fruit Name       
'234'   Orange          
'789'   Grapes      
'567'   Fig         
'412'   Carrot      
'589'   Apple       
'609'   Pomegranate
'778'   Mango       
'889'   Date        
'999'   Banana      
'101'   Strawberry

Trying to read some specific IDs from above file by below code. It takes a first whole file into memory then perform extraction data of specific ID. Could this process be done in another way such that without loading whole file only specific ID data can be extracted.

df=pd.read_csv("fruit.csv")
df=df[df.isin(['101','567']).any(1).values]

Please suggest idea of doing above use case with pandas or context manager in python.

CodePudding user response:

You can treat the file as a text file and read the lines one by one without loading them to the memory. If any line match the condition append it to the dataframe

with open('fruit.csv', 'r', encoding='utf-8') as f:
    df = pd.DataFrame(columns=next(f).strip().split(','))
    for line in f:
        values = line.strip().split(',')
        if values[0] in ["'101'", "'567'"]:
            df.loc[len(df)] = values
    print(df)

Output

     ID  Fruit Name
0  '567'         Fig
1  '101'  Strawberry

CodePudding user response:

DataFrame and "query" function are for fast manipulation.

data convert to DataFrame and use "query":

df = pd.read_csv("fruit.csv")
df = pd.DataFrame(df)
df.query('ID >101 and ID < 567')
print(df)

CodePudding user response:

An approach similar to that of @Guy, but using the DataFrame constructor to avoid the repeated loc assignment:

with open('fruits.csv') as f:
    cols = next(f).strip().split(',')
    S = {'101', '567'}
    df = pd.DataFrame([l for x in f
                       if (l:=x.rstrip().split(','))[0] in S],
                      columns=cols
                     )
print(df)

output:

    ID  Fruit Name
0  567         Fig
1  101  Strawberry

used input csv:

ID,Fruit Name       
234,Orange          
789,Grapes      
567,Fig         
412,Carrot      
589,Apple       
609,Pomegranate
778,Mango       
889,Date        
999,Banana      
101,Strawberry
  • Related