Suppose I have a data frame with three columns with dtypes (object
, int
, and float
):
df = pd.DataFrame({
'col1': [1, 2, np.nan, 5],
'col2': [3, 4, 5, 4],
'col3': ['This is a text column'] * 4
})
I need to replace the np.nan
with None
, which is an object (since None
becomes to NULL when imported to PostgresSQL).
df.replace({np.nan: None}, inplace=True)
I think (correct me if I'm wrong) None
cannot be used in any NumPy/Pandas array except for arrays with dtype object
. And so 'col1' above becomes an object
column after replace
. Now, if I wanted to subset only the string columns (which in this case should only be 'col3'), I can no longer use df.select_dtypes(include=object)
, which returns all object
dtype columns, including 'col1'. I've been working around this by using this hacky solution:
# Select only object columns, which includes 'col1'
(df.select_dtypes(include=object)
# Hack, after this, 'col1' becomes float again since None becomes np.nan
.apply(lambda col: col.apply(lambda val: val))
# Now select only the object columns
.select_dtypes(include=object))
I'm wondering if there are idiomatic (or less hacky) ways to accomplish this. The use case really arose since I need to get the string columns from a data frame where there are numeric (float
or int
) columns with missing values represented by None
rather than np.nan
.
CodePudding user response:
Based on your sample df, you can do something like this:
After replacing np.nan
to None
, col1
becomes an object
:
In [1413]: df.dtypes
Out[1413]:
col1 object
col2 int64
col3 object
dtype: object
To pick the columns which contains only strings, you can use pd.to_numeric
with errors='coerce'
and check if the column contains all Nan
using isna
:
In [1416]: cols = df.select_dtypes('object').columns.tolist()
In [1422]: cols
Out[1422]: ['col1', 'col3']
In [1424]: for i in cols:
...: if pd.to_numeric(df[i], errors='coerce').isna().all():
...: print(f'{i}: String col')
...: else:
...: print(f'{i}: number col')
...:
col1: number col
col3: String col
CodePudding user response:
Reverse your 2 operations:
- Extract
object
columns and process them. - Convert
NaN
toNone
before export to pgsql.
>>> df.dtypes
col1 float64
col2 int64
col3 object
dtype: object
# Step 1: process string columns
>>> df.update(df.select_dtypes('object').agg(lambda x: x.str.upper()))
# Step 2: replace nan by None
>>> df.replace({np.nan: None}, inplace=True)
>>> df
col1 col2 col3
0 1.0 3 THIS IS A TEXT COLUMN
1 2.0 4 THIS IS A TEXT COLUMN
2 None 5 THIS IS A TEXT COLUMN
3 5.0 4 THIS IS A TEXT COLUMN