Home > Software design >  Replacing all instances of standalone "." in a pandas dataframe
Replacing all instances of standalone "." in a pandas dataframe

Time:09-27

Beginner question incoming.

I have a dataframe derived from an excel file with a column that I will call "input". In this column are floats (e.g. 7.4, 8.1, 2.2,...). However, there are also some wrong values such as strings (which are easy to filter out) and, what I find difficult, single instances of "." or "..". I would like to clean the column to generate only numeric float values.

I have used this approach for other columns, but cannot do so here because if I get rid of the "." instances, my floats will be messed up:

for col in [col for col in new_df.columns if col.startswith("input")]:
    new_df[col] = new_df[col].str.replace(r',| |\-|\^|\ |#|j|0|.', '', regex=True)
    new_df[col] = pd.to_numeric(new_df[col], errors='raise')

I have also tried the following, but it then replaces every value in the column with None:

for index, row in new_df.iterrows():
    col_input = row['input']
    if re.match(r'^-?\d (?:.\d )$', str(col_input)) is None:
        new_df["input"] = None

How do I get rid of the dots?

Thanks!

CodePudding user response:

You can simply use pandas.to_numeric and pass errors='coerce' without the loop :

from io import StringIO
import pandas as pd

s = """input
7.4
8.1
2.2
foo
foo.bar
baz/foo"""

df = pd.read_csv(StringIO(s))

df['input'] = pd.to_numeric(df['input'], errors='coerce')

# Outputs :

print(df)

   input
0    7.4
1    8.1
2    2.2
3    NaN
4    NaN
5    NaN

df.dropna(inplace=True)

print(df)

   input
0  7.4
1  8.1
2  2.2

If you need to clean up multiple mixed columns, use :

cols = ['input', ...] # put here the name of the columns concerned
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
df.dropna(subset=cols, inplace=True)
  • Related