Let's say I have a boolean column stored as a category
in a pandas.DataFrame
. But there's a twist - the underlying values are str
, not bool
. I.e., the values are "True"
/"False"
, not True
/False
.
How do I:
- change the dtype of the underlying category values (e.g. from
"True"
toTrue
) and - continue storing the field as a
category
?
Having the boolean values as strings is an issue with DataFrame.query
, for example. I have to specify DataFrame.query("field == 'True'")
, which is pretty horrendous lol.
FYI - I don't want to do DataFrame.astype(dict(field=bool))
, because then i lose the memory efficiency from category
. i want to keep the category dtype.
CodePudding user response:
Maybe you can try:
df['field'] = df['field'].replace({'True': True, 'False': False})
print(df['field'])
# Output
0 False
1 True
2 True
3 False
Name: field, dtype: category
Categories (2, object): [False, True] # <- bool
With query
:
>>> df.query('field == True')
field
1 True
2 True
Setup:
df = pd.DataFrame({'field': ['False', 'True', 'True', 'False']}, dtype='category')
print(df['field'])
# Output
0 False
1 True
2 True
3 False
Name: field, dtype: category
Categories (2, object): ['False', 'True'] # <- str
CodePudding user response:
you could try to do that (the values can be used as bools but are mentionned as categories in the data type):
import pandas as pd
# before
data = ['True', 'False', 'True']
df = pd.DataFrame({'data': data}).astype("category")
print('[BEFORE] \n data type = {0} \n values : {1}'.format(df['data'].dtypes, df.values))
# after
df['data'] = list(map(bool, list(df['data'].values)))
df = df.astype("category")
print('[AFTER] \n data type = {0} \n values : {1}'.format(df['data'].dtypes, df.values))
output:
[BEFORE]
data type = category
values : [['True']
['False']
['True']]
[AFTER]
data type = category
values : [[True]
[True]
[True]]