A faster way to replace multiple string value in a column in Pandas-CodePudding

I am trying to replace multiple string values in a column and I understand that I can use the replace() to do it one by one. Given I need to replace more than 10 string values, I am just wondering if there's a faster way to replace a number of string values to the same value.

df = pd.DataFrame({'a':["US", "Japan", "UK", "China", "Peru", "Germany"]})
df.replace({'a' : { 'Japan' : 'Germany', 'UK' : 'Germany', 'China' : 'Germany' }})

Expected output:

         a
0       US
1  Germany
2  Germany
3  Germany
4     Peru
5  Germany

CodePudding user response：

Use numpy.where with Series.isin:

#60k rows
df = pd.DataFrame({'a':["US", "Japan", "UK", "China", "Peru", "Germany"] * 10000})

In [161]: %timeit df['a'] = df.a.map({ 'Japan' : 'Germany', 'UK' : 'Germany', 'China' : 'Germany' }).fillna(df.a)
12.4 ms ± 501 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [162]: %timeit df['a'] = np.where(df.a.isin(['Japan','UK','China']), 'Germany', df.a)
4.27 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   

#assignment raise error in test
In [1632]: %timeit df.replace({'a' : { 'Japan' : 'Germany', 'UK' : 'Germany', 'China' : 'Germany' }})
7.85 ms ± 462 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Slowier solution:

In [157]: %timeit df.replace('Japan|UK|China', 'Germany', regex=True)
218 ms ± 842 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

CodePudding user response：

Use:

df = df.replace('Japan|UK|China', 'Germany', regex=True)