Home > Software engineering >  How to remove all strings from a given DataFrame column (Machine Learning preprocessing)?
How to remove all strings from a given DataFrame column (Machine Learning preprocessing)?

Time:05-02

so I need to preprocess a column for machine learning in python. The column contains a series of 1s and 0s (which is the desired output), but there are some strings in there that needs to be removed ['PX7','D1', etc..]

I thought about using df.replace to replace the strings with np.nan and then using df.dropna() to remove it. I was wondering what is the standard way of doing this given that this is probably a very common preprocessing task.

Thank you

CodePudding user response:

Use:

df[df['col'].str.isdigit().fillna(True)]

Input:

enter image description here

Output:

enter image description here

CodePudding user response:

You can use:

df2 = df.where(df.isin([0,1]))

Or, convert to numeric to keep all numbers:

df2 = df.apply(pd.to_numeric, errors='coerce')

Then you can use dropna the way you want (if needed).

  • Related