Home > Enterprise >  Best way to read a huge CSV into a dataframe with a column with mixed value types
Best way to read a huge CSV into a dataframe with a column with mixed value types

Time:03-24

I'm trying to read a huge CSV file (almost 5GB) into a pandas dataframe. This CSV only has 3 columns like this:

 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   STORE_ID  404944 non-null  int64 
 1   SIZE      404944 non-null  int64 
 2   DISTANCE  404944 non-null  object

The problem is the column DISTANCE should only have int64 numbers, but somehow it contains some "null" values in the form of \\N. These \\N are causing my code to fail. Unfortunately I have no control over building this CSV, so I have no way of correcting it before hand.

This is a sample of the CSV:

STORE_ID,SIZE,DISTANCE
900072211,1,1000
900072212,1,1000
900072213,1,\\N
900072220,5,4500

I need to have this DISTANCE column with only int64 values.

Since the CSV is huge, I first tried to read it using the following code, assigning dtypes at the start:

df = pd.read_csv("polygons.csv", dtype={"STORE_ID": int, "SIZE": int, "DISTANCE": int})

But with this I got this error:

TypeError: Cannot cast array data from dtype('O') to dtype('int64') according to the rule 'safe'

How would you go about efficiently ready this csv to a dataframe? Is there a way to assign a dtype to the DISTANCE column while reading?

CodePudding user response:

Use na_values as parameter of pd.read_csv, it should solve your problem:

df = pd.read_csv(..., na_values=r'\\N')

Output:

>>> df
    STORE_ID  SIZE  DISTANCE
0  900072211     1    1000.0
1  900072212     1    1000.0
2  900072213     1       NaN
3  900072220     5    4500.0

>>> df.dtypes
STORE_ID      int64
SIZE          int64
DISTANCE    float64
dtype: object

Update

You can also use converters:

convert_N = lambda x: int(x) if x != r'\\N' else 0
df = pd.read_csv(..., converters={'DISTANCE': convert_N})

Output:

>>> df
    STORE_ID  SIZE  DISTANCE
0  900072211     1      1000
1  900072212     1      1000
2  900072213     1         0
3  900072220     5      4500

>>> df.dtypes
x1    int64
x2    int64
x3    int64
dtype: object
  • Related