I have one numeric feature in a data frame but in excel some of the values contain quotes which need to be removed.
Below table is what my data appears to be in Excel file now I want to remove quotes from last 3 rows using python.
Col1 | Col2 |
---|---|
123 | A |
456 | B |
789 | C |
"123" | D |
"456" | E |
"789" | F |
I have used following code in Python:
df["Col1"] = df['Col1'].replace('"', ' ').astype(int)
But above code gives me error message: invalid literal for int() with base 10: '"123"'.
I have also tried strip() function but still it is not working.
If I do not convert the data type and use below code
df["Col1"] = df['Col1'].replace('"', ' ')
Then the code is getting executed without any error however while saving the file into CSV it is still showing quotes.
CodePudding user response:
One way is to use converter function while reading Excel file. Something along those lines (assuming that data provided is in Excel file in columns 'A' and 'B'):
import pandas as pd
def conversion(value):
if type(value) == int:
return value
else:
return value.strip('"')
df = pd.read_excel('remove_quotes_excel.xlsx', header=None,
converters={0: conversion})
# df
0 1
0 123 A
1 456 B
2 789 C
3 123 D
4 456 E
5 789 F
Both columns are object type, but now (if needed) it is straightforward to convert to int:
df[0] = df[0].astype(int)
CodePudding user response:
You can do it by using this code. (Regex is if you get a warning)
df.Col1.replace('\"', '', regex = True, inplace = True)
CodePudding user response:
First convert the Col1 into series
df_Series = df['Col1']
Apply replace function on series
df_Series = df_Series.replace('"','').astype(int)
then append the Series into df data frame.
CodePudding user response:
Convert the values in the 'Col1' column of the dataframe 'df' to numeric values. If any of the values in the column cannot be converted to a numeric value, it will be set to NaN.
#df['Col1'] = df['Col1'].astype(int)
df['Col1'] = pd.to_numeric(df['Col1'], errors='coerce')