I have a df that have columns with str, int and timestamp columns. However, when I ask for dtypes, I get everything other than timestamp as objects, like below:
ABC datetime64[ns]
BCD datetime64[ns]
CDF object
DEF object
EFG object
FGH object
I read somewhere that this happens when there are missing values in a string column, Pandas stores them as a float, so the column becomes object type. I am able to see the types when I do the below:
t = {col: set(d_session.loc[~pd.isnull(d_session[col]),
col].apply(type).drop_duplicates()) for col in d_session}
t
I get the result as below:
{'ABC': {pandas._libs.tslibs.timestamps.Timestamp},
'BCD': {pandas._libs.tslibs.timestamps.Timestamp},
'CDE': {str},
'DEF': {str},
'EFG': {str},
'FGH': {str}}
My task now is to find a way where I am able to filter our all columns other than str and create a new df. To be clear, I want only str columns in this new df. But I am not able to use the above knowledge for filtering. I tried below:
for i in t.keys():
print(t[i] == "{<class 'str'>}")
I get:
False
False
False
False
False
False
Could someone help me how I can do this? I am not picky about how to do it, so it doesn't really have to be a modification of above. Thanks a lot for your help
CodePudding user response:
You can use select_dtypes
to exclude all object (string) dtype:
df.select_dtypes(exclude=object)
If you have some columns that contain data like list or dict, you can use convert_dtypes
before as suggested by @mozway:
df.convert_dtypes('string').select_dtypes(exclude='string')
From your idea, you can use:
d_session.loc[:, d_session.apply(lambda x: isinstance(x.dropna().iloc[0], str))]
CodePudding user response:
To get the list of columns with string values, you can use the following:
str_cols = [x for x in df.columns if type(df[x].loc[df[x].first_valid_index()])==str]
This code will ignore the missing values and check the data type. You can just change the condition to get the columns having values other than str.