I have a pandas dataframe with lot of columns. The dtype of all columns is object because some columns have strings in value. Is there a way to filter out rows into a different dataframe where value in any column is a string and then convert the cleaned dataframe to integer dtype.
I figured out the second part but not able to achieve the first part - filtering out rows if value contains string character like 'a', 'b' etc. for eg. if df is:
df = pd.DataFrame({
'col1':[1,2,'a',0,3],
'col2':[1,2,3,4,5],
'col3':[1,2,3,'45a5',4]
})
This should become 2 dataframes
df = pd.DataFrame({
'col1':[1,2,3],
'col2':[1,2,5],
'col3':[1,2,4]
})
dfError = pd.DataFrame({
'col1':['a',0],
'col2':[3,4],
'col3':[3,'45a5']
})
CodePudding user response:
Don't know if there's a performant way to check this. But a dirty way (might be slow) could be:
str_cond = df.applymap(lambda x: isinstance(x, str)).any(1)
df[~str_cond]
col1 col2 col3
0 1 1 1
1 2 2 2
4 3 5 4
df[str_cond]
col1 col2 col3
2 a 3 3
3 0 4 45a5
CodePudding user response:
I believe this to be an efficient way to do it.
import pandas as pd
df = pd.DataFrame({ # main dataframe
'col1':[1,2,'a',0,3],
'col2':[1,2,3,4,5],
'col3':[1,2,3,'45a5',4]
})
mask = df.apply(pd.to_numeric, errors='coerce').isna() # checks if couldn't be numeric
mask = mask.any(1) # check rows that couldn't be numeric
df1 = df[~mask] # could be numeric
df2 = df[mask] # couldn't be numeric
Breaking it down:
df.apply(pd.to_numeric) # converts the dataframe into numeric, but this would give us an error for the string elements (like 'a')
df.apply(pd.to_numeric, errors='coerce') # 'coerce' sets any non-valid element to NaN (converts the string elements to NaN).
>>>
col1 col2 col3
0 1.0 1 1.0
1 2.0 2 2.0
2 NaN 3 3.0
3 0.0 4 NaN
4 3.0 5 4.0
mask.isna() # Detect missing values.
>>>
col1 col2 col3
1 False False False
2 True False False
3 False False True
4 False False False
mask.any(1) # Returns whether any element is True along the rows
>>>
0 False
1 False
2 True
3 True
4 False