I have a big dataset and I cannot convert the dtype from object to int because of the error "invalid literal for int() with base 10:" I did some research and it is because there are some strings within the column.
How can I find those strings and replace them with numeric values?
CodePudding user response:
You might be looking for .str.isnumeric()
, which will only allow you to filter the data for these numbers-in-strings and act on them independently .. but you'll need to decide what those values should be
- converted (maybe they're money and you want to truncate
€
, or another date format that's not a UNIX epoch, or any number of possibilities..) - dropped (just throw them away)
- something else
>>> df = pd.DataFrame({"a":["1", "2", "x"]})
>>> df
a
0 1
1 2
2 x
>>> df[df["a"].str.isnumeric()]
a
0 1
1 2
>>> df[~df["a"].str.isnumeric()]
a
2 x
CodePudding user response:
Assuming 'col' the column name.
Just force convert to numeric, or NaN upon error:
df['col_num'] = pd.to_numeric(df['col'], errors='coerce')
If needed you can check which original values gave NaNs using:
df.loc[df['col'].notna()&df['col_num'].isna(), 'col']
CodePudding user response:
Base 10 means it is a float. so In python you would do
int(float(____))
Since you used int(), I'm guessing you needed an integer value.