Converting dtypes in messy pandas data-frame?-CodePudding

I have a big data-frame. I want to convert them to the appropriate dtype. The problem is that in several numeric columns there are strings. I know about convert_dtypes and to_numeric. With the former the problems is that it doesn't infer a column as int/float as soon as there strings there, to_numeric on the other hand has "coerce" which turns all the invalid examples to nan. The problem with to_numeric is that there are several columns that are strings, so I can't just run it on all columns.

So I am looking for a function that convert dtypes to numeric if there is a certain % of numeric values in it. It would be great if one could set the threshold for this.

As mentioned before the dataset is large, so I would prefer some solution that handles all the columns automatically.

CodePudding user response：

Use custom function with convert columns to numeric and if match condition return numeric column else original column in DataFrame.apply:

print (df)
   a  b  c  d  e
0  1  5  4  3  8
1  7  8  9  f  9
2  c  c  g  g  4
3  4  t  r  e  4

def f(x, thresh):
    y = pd.to_numeric(x, errors='coerce')
    return y if y.notna().mean() > thresh else x

thresh = 0.7
df1 = df.apply(f, args= (thresh,))
print (df1)
     a  b  c  d  e
0  1.0  5  4  3  8
1  7.0  8  9  f  9
2  NaN  c  g  g  4
3  4.0  t  r  e  4

print (df1.dtypes)
a    float64
b     object
c     object
d     object
e      int64
dtype: object

Modified solution with missing values (if exist):

print (df)
   a  b    c  d  e
0  1  5    4  3  8
1  7  8  NaN  f  9
2  c  c  NaN  g  4
3  4  t    r  e  4

def f(x, thresh):
    y = pd.to_numeric(x, errors='coerce')
    return y if (y.notna() | x.isna()).mean() > thresh else x

thresh = 0.7
df1 = df.apply(f, args= (thresh,))
print (df1)
     a  b    c  d  e
0  1.0  5  4.0  3  8
1  7.0  8  NaN  f  9
2  NaN  c  NaN  g  4
3  4.0  t  NaN  e  4

print (df1.dtypes)
a    float64
b     object
c    float64
d     object
e      int64
dtype: object