Home > front end >  How to select only string (non-numeric) columns when there are mixed type columns?
How to select only string (non-numeric) columns when there are mixed type columns?

Time:12-19

Suppose I have a data frame with three columns with dtypes (object, int, and float):

df = pd.DataFrame({
    'col1': [1, 2, np.nan, 5],
    'col2': [3, 4, 5, 4],
    'col3': ['This is a text column'] * 4
})

I need to replace the np.nan with None, which is an object (since None becomes to NULL when imported to PostgresSQL).

df.replace({np.nan: None}, inplace=True)

I think (correct me if I'm wrong) None cannot be used in any NumPy/Pandas array except for arrays with dtype object. And so 'col1' above becomes an object column after replace. Now, if I wanted to subset only the string columns (which in this case should only be 'col3'), I can no longer use df.select_dtypes(include=object), which returns all object dtype columns, including 'col1'. I've been working around this by using this hacky solution:

# Select only object columns, which includes 'col1'
(df.select_dtypes(include=object)
   # Hack, after this, 'col1' becomes float again since None becomes np.nan
   .apply(lambda col: col.apply(lambda val: val))
   # Now select only the object columns
   .select_dtypes(include=object))

I'm wondering if there are idiomatic (or less hacky) ways to accomplish this. The use case really arose since I need to get the string columns from a data frame where there are numeric (float or int) columns with missing values represented by None rather than np.nan.

CodePudding user response:

Based on your sample df, you can do something like this:

After replacing np.nan to None, col1 becomes an object:

In [1413]: df.dtypes
Out[1413]: 
col1    object
col2     int64
col3    object
dtype: object

To pick the columns which contains only strings, you can use pd.to_numeric with errors='coerce' and check if the column contains all Nan using isna:

In [1416]: cols = df.select_dtypes('object').columns.tolist()
In [1422]: cols
Out[1422]: ['col1', 'col3']

In [1424]: for i in cols:
      ...:     if pd.to_numeric(df[i], errors='coerce').isna().all():
      ...:         print(f'{i}: String col')
      ...:     else:
      ...:         print(f'{i}: number col')
      ...: 
col1: number col
col3: String col

CodePudding user response:

Reverse your 2 operations:

  1. Extract object columns and process them.
  2. Convert NaN to None before export to pgsql.
>>> df.dtypes
col1    float64
col2      int64
col3     object
dtype: object

# Step 1: process string columns
>>> df.update(df.select_dtypes('object').agg(lambda x: x.str.upper()))

# Step 2: replace nan by None
>>> df.replace({np.nan: None}, inplace=True)

>>> df
   col1  col2                   col3
0   1.0     3  THIS IS A TEXT COLUMN
1   2.0     4  THIS IS A TEXT COLUMN
2  None     5  THIS IS A TEXT COLUMN
3   5.0     4  THIS IS A TEXT COLUMN
  • Related