I'm trying to read a huge CSV file (almost 5GB) into a pandas dataframe. This CSV only has 3 columns like this:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 STORE_ID 404944 non-null int64
1 SIZE 404944 non-null int64
2 DISTANCE 404944 non-null object
The problem is the column DISTANCE should only have int64 numbers, but somehow it contains some "null" values in the form of \\N. These \\N are causing my code to fail. Unfortunately I have no control over building this CSV, so I have no way of correcting it before hand.
This is a sample of the CSV:
STORE_ID,SIZE,DISTANCE
900072211,1,1000
900072212,1,1000
900072213,1,\\N
900072220,5,4500
I need to have this DISTANCE column with only int64 values.
Since the CSV is huge, I first tried to read it using the following code, assigning dtypes at the start:
df = pd.read_csv("polygons.csv", dtype={"STORE_ID": int, "SIZE": int, "DISTANCE": int})
But with this I got this error:
TypeError: Cannot cast array data from dtype('O') to dtype('int64') according to the rule 'safe'
How would you go about efficiently ready this csv to a dataframe? Is there a way to assign a dtype to the DISTANCE column while reading?
CodePudding user response:
Use na_values
as parameter of pd.read_csv
, it should solve your problem:
df = pd.read_csv(..., na_values=r'\\N')
Output:
>>> df
STORE_ID SIZE DISTANCE
0 900072211 1 1000.0
1 900072212 1 1000.0
2 900072213 1 NaN
3 900072220 5 4500.0
>>> df.dtypes
STORE_ID int64
SIZE int64
DISTANCE float64
dtype: object
Update
You can also use converters
:
convert_N = lambda x: int(x) if x != r'\\N' else 0
df = pd.read_csv(..., converters={'DISTANCE': convert_N})
Output:
>>> df
STORE_ID SIZE DISTANCE
0 900072211 1 1000
1 900072212 1 1000
2 900072213 1 0
3 900072220 5 4500
>>> df.dtypes
x1 int64
x2 int64
x3 int64
dtype: object