Home > Software engineering >  How to drop columns from a pandas DataFrame that have elements containing a string?
How to drop columns from a pandas DataFrame that have elements containing a string?

Time:06-08

This is not about dropping columns whose name contains a string.

I have a dataframe with 1600 columns. Several hundred are garbage. Most of the garbage columns contain a phrase such as invalid value encountered in double_scalars (XYZ) where `XYZ' is a filler name for the column name.

I would like to delete all columns that contain, in any of their elements, the string invalid

Purging columns with strings in general would work too. What I want is to clean it up so I can fit a machine learning model to it, so removing any/all columns that are not boolean or real would work.

This must be a duplicate question, but I can only find answers to how to remove a column with a specific column name.

CodePudding user response:

Use apply to make a mask checking if each column contains invalid, and then pass that mask to the second position of .loc:

df = df.loc[:, ~df.apply(lambda col: col.astype(str).str.contains('invalid')).any()]

CodePudding user response:

You can use df.select_dtypes(include=[float,bool]) or df.select_dtypes(exclude=['object'])

Link to docs https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html

  • Related