I have a dataset "app_metadata.csv" with three column: item_id, category, description. item_id is integer number, category is a string, and description is a string. I loaded the dataset with the following code
app_metadata_df = pd.read_csv(app_metadata_csv_path)
However there is broken data in the dataset, for example there exist a row at which item_id is not a number but a text. I want to remove rows with invalid item_id value and convert the item_id column datatype to int. The following is what i have tried, first i call pd.to_numeric with errors='coerce'
app_metadata_df.loc["item_id"] = pd.to_numeric(app_metadata_df["item_id"], errors='coerce')
Then i drop NA value
app_metadata_df.loc["item_id"] = app_metadata_df.loc["item_id"].dropna()
And finally call the astype(int) to convert the datatype to int:
app_metadata_df.loc["item_id"] = app_metadata_df["item_id"].astype(int)
However, it throws the following error
invalid literal for int() with base 10: 'So'
It looks like to_numeric does not convert some invalid value to NAN. Why is this happening and how do i fix this ?
CodePudding user response:
Try this :
app_metadata_df = pd.read_csv(app_metadata_csv_path)
app_metadata_df['item_id'] = pd.to_numeric(app_metadata_df["item_id"], errors='coerce')
app_metadata_df = app_metadata_df[app_metadata_df['item_id'].notna()].reset_index()
app_metadata_df["item_id"] = app_metadata_df["item_id"].astype(int)
>>> print(app_metadata_df)
index item_id category \
0 0 593676.0 HEALTH_AND_FITNESS
1 1 601235.0 GAME
2 2 860079.0 COMMUNICATION
3 3 64855.0 VIDEO_PLAYERS
4 4 597756.0 MEDICAL
... ... ... ...
98577 98594 683377.0 TOOLS
98578 98595 862905.0 FINANCE
98579 98596 165878.0 MUSIC_AND_AUDIO
98580 98597 683417.0 PHOTOGRAPHY
98581 98598 703224.0 GAME
description
0 Abs Workout, designed by professional fitness ...
1 The best building game on android is free to d...
2 Tamil Actress Stickers app has 200 Tamil her...
3 The simplest VLC Remote you'll ever find. Peri...
4 This is the official mobile app of the Nationa...
... ...
98577 endoscope app for android an app to connect wi...
98578 Acerca de esta app<br>La App OCA está pensada ...
98579 This app provides free downloading of audio sh...
98580 <b>Water Paint : Colour Effect</b><br><br>Want...
98581 DIAMOND CRUSH with spectacular graphics and ex...
[98582 rows x 4 columns]
>>> print(app_metadata_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98582 entries, 0 to 98581
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 98582 non-null int64
1 item_id 98582 non-null int32
2 category 98582 non-null object
3 description 98582 non-null object
dtypes: int32(1), int64(1), object(2)
memory usage: 2.6 MB
CodePudding user response:
Problem is you use app_metadata_df.loc["item_id"]
which creates a new index item_id
app_metadata_df["item_id"] = pd.to_numeric(app_metadata_df["item_id"], errors='coerce')