How to select only string (non-numeric) columns when there are mixed type columns?-CodePudding

Suppose I have a data frame with three columns with dtypes (object, int, and float):

df = pd.DataFrame({
    'col1': [1, 2, np.nan, 5],
    'col2': [3, 4, 5, 4],
    'col3': ['This is a text column'] * 4
})

I need to replace the np.nan with None, which is an object (since None becomes to NULL when imported to PostgresSQL).

df.replace({np.nan: None}, inplace=True)

I think (correct me if I'm wrong) None cannot be used in any NumPy/Pandas array except for arrays with dtype object. And so 'col1' above becomes an object column after replace. Now, if I wanted to subset only the string columns (which in this case should only be 'col3'), I can no longer use df.select_dtypes(include=object), which returns all object dtype columns, including 'col1'. I've been working around this by using this hacky solution:

# Select only object columns, which includes 'col1'
(df.select_dtypes(include=object)
   # Hack, after this, 'col1' becomes float again since None becomes np.nan
   .apply(lambda col: col.apply(lambda val: val))
   # Now select only the object columns
   .select_dtypes(include=object))

I'm wondering if there are idiomatic (or less hacky) ways to accomplish this. The use case really arose since I need to get the string columns from a data frame where there are numeric (float or int) columns with missing values represented by None rather than np.nan.

CodePudding user response：

Based on your sample df, you can do something like this:

After replacing np.nan to None, col1 becomes an object:

In [1413]: df.dtypes
Out[1413]: 
col1    object
col2     int64
col3    object
dtype: object

To pick the columns which contains only strings, you can use pd.to_numeric with errors='coerce' and check if the column contains all Nan using isna:

In [1416]: cols = df.select_dtypes('object').columns.tolist()
In [1422]: cols
Out[1422]: ['col1', 'col3']

In [1424]: for i in cols:
      ...:     if pd.to_numeric(df[i], errors='coerce').isna().all():
      ...:         print(f'{i}: String col')
      ...:     else:
      ...:         print(f'{i}: number col')
      ...: 
col1: number col
col3: String col

CodePudding user response：

Reverse your 2 operations:

Extract object columns and process them.
Convert NaN to None before export to pgsql.

>>> df.dtypes
col1    float64
col2      int64
col3     object
dtype: object

# Step 1: process string columns
>>> df.update(df.select_dtypes('object').agg(lambda x: x.str.upper()))

# Step 2: replace nan by None
>>> df.replace({np.nan: None}, inplace=True)

>>> df
   col1  col2                   col3
0   1.0     3  THIS IS A TEXT COLUMN
1   2.0     4  THIS IS A TEXT COLUMN
2  None     5  THIS IS A TEXT COLUMN
3   5.0     4  THIS IS A TEXT COLUMN