I'd like to modify the col1 of the following dataframe df:
col1 col2
0 Black 7
1 Death 2
2 Hardcore 6
3 Grindcore 1
4 Deathcore 4
...
I want to use a dict named cat_dic={'Black':'B', 'Death':'D', 'Hardcore':'H'}
to get the following dataframe:
col1 col2
0 B 7
1 D 2
2 H 6
3 None 1
4 None 4
...
I know I can use df.map
or df.replace
, for example like this:
df.replace({"col1":cat_dic})
but I want the KeyErrors of the dictionnary to return None, and with the previous line, I got this result instead:
col1 col2
0 B 7
1 D 2
2 H 6
3 Grindcore 1
4 Deathcore 4
...
Given that Grindcore and Deathcore are not the only 2 values in col1 that I want to be set to None, have you got any idea on how to do it ?
CodePudding user response:
Use dict.get
:
df['col1'] = df['col1'].map(lambda x: cat_dic.get(x, None))
#default value is None
df['col1'] = df['col1'].map(cat_dic.get)
print (df)
col1 col2
0 B 7
1 D 2
2 H 6
3 None 1
4 None 4
Performance comparison in 50k rows:
df = pd.concat([df] * 10000, ignore_index=True)
cat_dic={'Black':'B', 'Death':'D', 'Hardcore':'H'}
In [93]: %timeit df['col1'].map(cat_dic.get)
3.22 ms ± 16.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [94]: %timeit df.col1.apply(lambda x: None if x not in cat_dic.keys() else cat_dic[x])
15 ms ± 293 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [95]: %timeit df['col1'].replace(dict(dict.fromkeys(df['col1'].unique(), None), **cat_dic))
12.3 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [96]: %timeit df.col1.apply(lambda x: None if x not in cat_dic.keys() else x)
13.8 ms ± 837 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [97]: %timeit df['col1'].map(cat_dic).replace(dict({np.nan: None}))
9.97 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
CodePudding user response:
You may use pd.apply
first
df.col1 = df.col1.apply(lambda x: None if x not in cat_dic.keys() else x)
Then, you can safely use pd.replace
df.replace({"col1":cat_dic})
CodePudding user response:
This can be done in One line:
df1['col1'] = df1.col1.apply(lambda x: None if x not in cat_dic.keys() else cat_dic[x])
Output is:
col1 col2
0 B 7
1 D 2
2 H 6
3 None 1
4 None 4
CodePudding user response:
Here is a one liner easy solution which gives us the expected output.
df['col1'] = df['col1'].map(cat_dic).replace(dict({np.nan: None}))
Output :
col1 col2
0 B 7
1 D 2
2 H 6
3 None 1
4 None 4
CodePudding user response:
Series.map
already maps NaN
to the mismatched key
$ print(df['col1'].map(cat_dic))
0 B
1 D
2 H
3 NaN
4 NaN
Name: col1, dtype: object
Anyway, you can update your cat_dic
with missing keys from col1
column
cat_dic = dict(dict.fromkeys(df['col1'].unique(), None), **cat_dic)
df['col1'] = df['col1'].replace(cat_dic)
print(cat_dic)
{'Black': 'B', 'Death': 'D', 'Hardcore': 'H', 'Grindcore': None, 'Deathcore': None}
print(df)
col1 col2
0 B 7
1 D 2
2 H 6
3 None 1
4 None 4
CodePudding user response:
In [6]: df.col1.map(cat_dic.get)
Out[6]:
0 B
1 D
2 H
3 None
4 None
dtype: object
You could also use apply
, both work. When working on a Series
, map
is faster I think.
Explanation:
You can get a default value for missing keys by using dict.get
instead using the [..]
-operator. By default, this default value is None
. So simply passing the dict.get
method to apply
/map
just works.