I have the following Pandas DataFrame, with city and arr columns:
city arr final_target
paris 11 paris_11
paris 12 paris_12
dallas 22 dallas
miami 15 miami
paris 16 paris_16
My goal is to fill the final_target column concatenating paris and arr number, when city name is Paris, and just filling with the name when the name is not Paris.
What is the most pythonic way to do this ?
CodePudding user response:
What is the most pythonic way to do this ?
It depends by definion. If it is more preferable, most common and fastest way then np.where
solution is here most pythonic way.
Use numpy.where
, if need pandaic also this solutions are vectorized, so should be more preferable like apply
(loops under the hood):
df['final_target'] = np.where(df['city'].eq('paris'),
df['city'] '_' df['arr'].astype(str),
df['city'])
Pandas alternatives:
df['final_target'] = df['city'].mask(df['city'].eq('paris'),
df['city'] '_' df['arr'].astype(str))
df['final_target'] = df['city'].where(df['city'].ne('paris'),
df['city'] '_' df['arr'].astype(str))
print (df)
city arr final_target
0 paris 11 paris_11
1 paris 12 paris_12
2 dallas 22 dallas
3 miami 15 miami
4 paris 16 paris_16
Performance:
#50k rows
df = pd.concat([df] * 10000, ignore_index=True)
In [157]: %%timeit
...: df['final_target'] = np.where(df['city'].eq('paris'),
...: df['city'] '_' df['arr'].astype(str),
...: df['city'])
...:
48.6 ms ± 444 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [158]: %%timeit
...: df['city'] (df['city'] == 'paris')*('_' df['arr'].astype(str))
...:
...:
49.2 ms ± 1.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [159]: %%timeit
...: df['final_target'] = df['city']
...: df.loc[df['city'] == 'paris', 'final_target'] = '_' df['arr'].astype(str)
...:
63.8 ms ± 764 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [160]: %%timeit
...: df['final_target'] = df.apply(lambda x: x.city '_' str(x.arr) if x.city == 'paris' else x.city, axis = 1)
...:
...:
1.33 s ± 119 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
CodePudding user response:
A one-liner code does the trick:
df['final_target'] = df.apply(lambda x: x.city '_' str(x.arr) if x.city == 'paris' else x.city, axis = 1)
CodePudding user response:
Try this neat and and short two lines with loc
:
df['final_target'] = df['city']
df.loc[df['city'] == 'paris', 'final_target'] = '_' df['arr'].astype(str)
This solution firstly assigns df['city']
as the final_target
column, then adds the arr
column separated by underscore if the city
column is paris
.
IMO this is probably the most Pythonic and neat way here.
print(df)
city arr final_target
0 paris 11 paris_11
1 paris 12 paris_12
2 dallas 22 dallas
3 miami 15 miami
4 paris 16 paris_16
CodePudding user response:
Pretty self explanatory, one line, looks pythonic
df['city'] (df['city'] == 'paris')*('_' df['arr'].astype(str))
s = """city,arr,final_target
paris,11,paris_11
paris,12,paris_12
dallas,22,dallas
miami,15,miami
paris,16,paris_16"""
import pandas as pd
import io
df = pd.read_csv(io.StringIO(s)).sample(1000000, replace=True)
df
Speeds
%%timeit
df['city'] (df['city'] == 'paris')*('_' df['arr'].astype(str))
# 877 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['final_target'] = np.where(df['city'].eq('paris'),
df['city'] '_' df['arr'].astype(str),
df['city'])
# 874 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I am not sure why this example fails and why other dont on same input
%%timeit
df['final_target'] = df['city']
df.loc[df['city'] == 'paris', 'final_target'] = '_' df['arr'].astype(str)
MemoryError: Unable to allocate 892. GiB for an array with shape (119671145392,) and data type int64