Pandas to_numeric(errors='coerce') does not convert invalid value to nan


I have a dataset "app_metadata.csv" with three column: item_id, category, description. item_id is integer number, category is a string, and description is a string. I loaded the dataset with the following code

app_metadata_df = pd.read_csv(app_metadata_csv_path)

However there is broken data in the dataset, for example there exist a row at which item_id is not a number but a text. I want to remove rows with invalid item_id value and convert the item_id column datatype to int. The following is what i have tried, first i call pd.to_numeric with errors='coerce'

app_metadata_df.loc["item_id"] = pd.to_numeric(app_metadata_df["item_id"], errors='coerce')

Then i drop NA value

app_metadata_df.loc["item_id"] = app_metadata_df.loc["item_id"].dropna()

And finally call the astype(int) to convert the datatype to int:

app_metadata_df.loc["item_id"] = app_metadata_df["item_id"].astype(int)

However, it throws the following error

invalid literal for int() with base 10: 'So'

It looks like to_numeric does not convert some invalid value to NAN. Why is this happening and how do i fix this ?

CodePudding user response:

Try this :

app_metadata_df = pd.read_csv(app_metadata_csv_path)

app_metadata_df['item_id'] = pd.to_numeric(app_metadata_df["item_id"], errors='coerce')
app_metadata_df = app_metadata_df[app_metadata_df['item_id'].notna()].reset_index()

app_metadata_df["item_id"] = app_metadata_df["item_id"].astype(int)
>>> print(app_metadata_df)
       index   item_id            category  \
0          0  593676.0  HEALTH_AND_FITNESS   
1          1  601235.0                GAME   
2          2  860079.0       COMMUNICATION   
3          3   64855.0       VIDEO_PLAYERS   
4          4  597756.0             MEDICAL   
...      ...       ...                 ...   
98577  98594  683377.0               TOOLS   
98578  98595  862905.0             FINANCE   
98579  98596  165878.0     MUSIC_AND_AUDIO   
98580  98597  683417.0         PHOTOGRAPHY   
98581  98598  703224.0                GAME   

0      Abs Workout, designed by professional fitness ...  
1      The best building game on android is free to d...  
2      Tamil Actress Stickers app has 200   Tamil her...  
3      The simplest VLC Remote you'll ever find. Peri...  
4      This is the official mobile app of the Nationa...  
...                                                  ...  
98577  endoscope app for android an app to connect wi...  
98578  Acerca de esta app<br>La App OCA está pensada ...  
98579  This app provides free downloading of audio sh...  
98580  <b>Water Paint : Colour Effect</b><br><br>Want...  
98581  DIAMOND CRUSH with spectacular graphics and ex...  

[98582 rows x 4 columns]
>>> print(app_metadata_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98582 entries, 0 to 98581
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   index        98582 non-null  int64 
 1   item_id      98582 non-null  int32 
 2   category     98582 non-null  object
 3   description  98582 non-null  object
dtypes: int32(1), int64(1), object(2)
memory usage: 2.6  MB

CodePudding user response:

Problem is you use app_metadata_df.loc["item_id"] which creates a new index item_id

app_metadata_df["item_id"] = pd.to_numeric(app_metadata_df["item_id"], errors='coerce')
