I am trying to read some specific ids from a large file 5GB.
For Ex below is the sample file having columns ['ID','Fruit Name']
ID Fruit Name
'234' Orange
'789' Grapes
'567' Fig
'412' Carrot
'589' Apple
'609' Pomegranate
'778' Mango
'889' Date
'999' Banana
'101' Strawberry
Trying to read some specific IDs from above file by below code. It takes a first whole file into memory then perform extraction data of specific ID. Could this process be done in another way such that without loading whole file only specific ID data can be extracted.
df=pd.read_csv("fruit.csv")
df=df[df.isin(['101','567']).any(1).values]
Please suggest idea of doing above use case with pandas or context manager in python.
CodePudding user response:
You can treat the file as a text file and read the lines one by one without loading them to the memory. If any line match the condition append it to the dataframe
with open('fruit.csv', 'r', encoding='utf-8') as f:
df = pd.DataFrame(columns=next(f).strip().split(','))
for line in f:
values = line.strip().split(',')
if values[0] in ["'101'", "'567'"]:
df.loc[len(df)] = values
print(df)
Output
ID Fruit Name
0 '567' Fig
1 '101' Strawberry
CodePudding user response:
DataFrame and "query" function are for fast manipulation.
data convert to DataFrame and use "query":
df = pd.read_csv("fruit.csv")
df = pd.DataFrame(df)
df.query('ID >101 and ID < 567')
print(df)
CodePudding user response:
An approach similar to that of @Guy, but using the DataFrame constructor to avoid the repeated loc
assignment:
with open('fruits.csv') as f:
cols = next(f).strip().split(',')
S = {'101', '567'}
df = pd.DataFrame([l for x in f
if (l:=x.rstrip().split(','))[0] in S],
columns=cols
)
print(df)
output:
ID Fruit Name
0 567 Fig
1 101 Strawberry
used input csv:
ID,Fruit Name
234,Orange
789,Grapes
567,Fig
412,Carrot
589,Apple
609,Pomegranate
778,Mango
889,Date
999,Banana
101,Strawberry