Home > Enterprise >  dtypes showing all objects, filter on str
dtypes showing all objects, filter on str

Time:03-12

I have a df that have columns with str, int and timestamp columns. However, when I ask for dtypes, I get everything other than timestamp as objects, like below:

ABC                         datetime64[ns]
BCD                         datetime64[ns]
CDF                                 object
DEF                                 object
EFG                                 object
FGH                                 object

I read somewhere that this happens when there are missing values in a string column, Pandas stores them as a float, so the column becomes object type. I am able to see the types when I do the below:

t = {col: set(d_session.loc[~pd.isnull(d_session[col]), 
col].apply(type).drop_duplicates()) for col in d_session}
t

I get the result as below:

{'ABC': {pandas._libs.tslibs.timestamps.Timestamp},
'BCD': {pandas._libs.tslibs.timestamps.Timestamp},
'CDE': {str},
'DEF': {str},
'EFG': {str},
'FGH': {str}}

My task now is to find a way where I am able to filter our all columns other than str and create a new df. To be clear, I want only str columns in this new df. But I am not able to use the above knowledge for filtering. I tried below:

for i in t.keys():
print(t[i] == "{<class 'str'>}")

I get:

False
False
False
False
False
False

Could someone help me how I can do this? I am not picky about how to do it, so it doesn't really have to be a modification of above. Thanks a lot for your help

CodePudding user response:

You can use select_dtypes to exclude all object (string) dtype:

df.select_dtypes(exclude=object)

If you have some columns that contain data like list or dict, you can use convert_dtypes before as suggested by @mozway:

df.convert_dtypes('string').select_dtypes(exclude='string')

From your idea, you can use:

d_session.loc[:, d_session.apply(lambda x: isinstance(x.dropna().iloc[0], str))]

CodePudding user response:

To get the list of columns with string values, you can use the following:

str_cols = [x for x in df.columns if type(df[x].loc[df[x].first_valid_index()])==str]

This code will ignore the missing values and check the data type. You can just change the condition to get the columns having values other than str.

  • Related