Remap values in pandas column with a dict, None if KeyError-CodePudding

I'd like to modify the col1 of the following dataframe df:

        col1        col2
0       Black       7
1       Death       2
2       Hardcore    6
3       Grindcore   1
4       Deathcore   4
...

I want to use a dict named cat_dic={'Black':'B', 'Death':'D', 'Hardcore':'H'} to get the following dataframe:

        col1        col2
0       B           7
1       D           2
2       H           6
3       None        1
4       None        4
...

I know I can use df.map or df.replace, for example like this:

df.replace({"col1":cat_dic})

but I want the KeyErrors of the dictionnary to return None, and with the previous line, I got this result instead:

        col1        col2
0       B           7
1       D           2
2       H           6
3       Grindcore   1
4       Deathcore   4
...

Given that Grindcore and Deathcore are not the only 2 values in col1 that I want to be set to None, have you got any idea on how to do it ?

CodePudding user response：

Use dict.get:

df['col1'] = df['col1'].map(lambda x: cat_dic.get(x, None))
#default value is None
df['col1'] = df['col1'].map(cat_dic.get)

print (df)
   col1  col2
0     B     7
1     D     2
2     H     6
3  None     1
4  None     4

Performance comparison in 50k rows:

df = pd.concat([df] * 10000, ignore_index=True)
cat_dic={'Black':'B', 'Death':'D', 'Hardcore':'H'}

In [93]: %timeit df['col1'].map(cat_dic.get)
3.22 ms ± 16.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [94]: %timeit df.col1.apply(lambda x: None if x not in cat_dic.keys() else cat_dic[x])
15 ms ± 293 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [95]: %timeit df['col1'].replace(dict(dict.fromkeys(df['col1'].unique(), None), **cat_dic))
12.3 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [96]: %timeit df.col1.apply(lambda x: None if x not in cat_dic.keys() else x)
13.8 ms ± 837 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [97]: %timeit df['col1'].map(cat_dic).replace(dict({np.nan: None}))
9.97 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

CodePudding user response：

You may use pd.apply first

df.col1 = df.col1.apply(lambda x: None if x not in cat_dic.keys() else x)

Then, you can safely use pd.replace

df.replace({"col1":cat_dic})

CodePudding user response：

This can be done in One line:

df1['col1'] = df1.col1.apply(lambda x: None if x not in cat_dic.keys() else cat_dic[x])

Output is:

   col1  col2
0     B     7
1     D     2
2     H     6
3  None     1
4  None     4

CodePudding user response：

Here is a one liner easy solution which gives us the expected output.

df['col1'] = df['col1'].map(cat_dic).replace(dict({np.nan: None}))

Output :

   col1  col2
0     B     7
1     D     2
2     H     6
3  None     1
4  None     4

CodePudding user response：

Series.map already maps NaN to the mismatched key

$ print(df['col1'].map(cat_dic))

0      B
1      D
2      H
3    NaN
4    NaN
Name: col1, dtype: object

Anyway, you can update your cat_dic with missing keys from col1 column

cat_dic = dict(dict.fromkeys(df['col1'].unique(), None), **cat_dic)
df['col1'] = df['col1'].replace(cat_dic)

print(cat_dic)

{'Black': 'B', 'Death': 'D', 'Hardcore': 'H', 'Grindcore': None, 'Deathcore': None}

print(df)

   col1  col2
0     B     7
1     D     2
2     H     6
3  None     1
4  None     4

CodePudding user response：

In [6]: df.col1.map(cat_dic.get)
Out[6]: 
0       B
1       D
2       H
3    None
4    None
dtype: object

You could also use apply, both work. When working on a Series, map is faster I think.

Explanation:

You can get a default value for missing keys by using dict.get instead using the [..]-operator. By default, this default value is None. So simply passing the dict.get method to apply/map just works.