Home > Mobile >  filter rows failing dtype conversion
filter rows failing dtype conversion

Time:10-06

I have a pandas dataframe with lot of columns. The dtype of all columns is object because some columns have strings in value. Is there a way to filter out rows into a different dataframe where value in any column is a string and then convert the cleaned dataframe to integer dtype.

I figured out the second part but not able to achieve the first part - filtering out rows if value contains string character like 'a', 'b' etc. for eg. if df is:

df = pd.DataFrame({
    'col1':[1,2,'a',0,3],
    'col2':[1,2,3,4,5],
    'col3':[1,2,3,'45a5',4]
    })

This should become 2 dataframes

df = pd.DataFrame({
    'col1':[1,2,3],
    'col2':[1,2,5],
    'col3':[1,2,4]
    })

dfError = pd.DataFrame({
    'col1':['a',0],
    'col2':[3,4],
    'col3':[3,'45a5']
    })

CodePudding user response:

Don't know if there's a performant way to check this. But a dirty way (might be slow) could be:

str_cond = df.applymap(lambda x: isinstance(x, str)).any(1)

df[~str_cond]
  col1  col2 col3
0    1     1    1
1    2     2    2
4    3     5    4

df[str_cond]
  col1  col2  col3
2    a     3     3
3    0     4  45a5

CodePudding user response:

I believe this to be an efficient way to do it.

import pandas as pd

df = pd.DataFrame({ # main dataframe
    'col1':[1,2,'a',0,3],
    'col2':[1,2,3,4,5],
    'col3':[1,2,3,'45a5',4]
    }) 

mask = df.apply(pd.to_numeric, errors='coerce').isna() # checks if couldn't be numeric
mask = mask.any(1) # check rows that couldn't be numeric

df1 = df[~mask] # could be numeric
df2 = df[mask]  # couldn't be numeric

Breaking it down:

df.apply(pd.to_numeric) # converts the dataframe into numeric, but this would give us an error for the string elements (like 'a')

df.apply(pd.to_numeric, errors='coerce') # 'coerce' sets any non-valid element to NaN (converts the string elements to NaN).

>>>
   col1  col2  col3
0   1.0     1   1.0
1   2.0     2   2.0
2   NaN     3   3.0
3   0.0     4   NaN
4   3.0     5   4.0

mask.isna() # Detect missing values.
>>>
    col1   col2   col3
1  False  False  False
2   True  False  False
3  False  False   True
4  False  False  False

mask.any(1) # Returns whether any element is True along the rows
>>>
0    False
1    False
2     True
3     True
4    False
  • Related