I have the following code to load data
import pandas as pd
data = pd.read_csv("Salary-Data.csv")
data["Income"] = data["Income"].str.strip()
#data["Income"] = data["Income"].apply(pd.to_numeric, errors='coerce')
#data["Income"] = data["Income"].astype(int)
data
This produces the following error:
~/miniconda3/envs/scientific-base/lib/python3.8/site-packages/pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()
ValueError: invalid literal for int() with base 10: '16\xa0638'
The first value in the Income
column is 16 638
(with a space).
If I comment out the erroring line and inspect the dataframe, the values in Income
column still contain spaces.
What is going on? How can I make this column into one of valid integers or floats?
CodePudding user response:
Change strip
to replace
data["Income"] = data["Income"].str.replace(' ','')
CodePudding user response:
Here is another way to do it, i.e., to replace out all non-digit characters
df['income'].replace(r'\D','',regex=True)
to keep the decimal as part of the number
df['income'].replace(r'[^0-9,\.]','',regex=True)