I have a dataframe df
as shown below. I want replace all the temp_id
column values which are having _
(underscore with another value which combination of numerical part of the temp_id city country column values.
df
temp_id city country
12225IND DELHI IND
14445UX_TY AUSTIN US
56784SIN BEDOK SIN
72312SD_IT_UZ NEW YORK US
47853DUB DUBAI UAE
80976UT_IS_SZ SYDENY AUS
89012TY_JP_IS TOKOYO JPN
51309HJ_IS_IS
42087IND MUMBAI IND
Expected Output
temp_id city country
12225IND DELHI IND
14445AUSTINUS AUSTIN US
56784SIN BEDOK SIN
72312NEWYORKUS NEW YORK US
47853DUB DUBAI UAE
80976SYDENYAUS SYDENY AUS
89012TOKOYOJPN TOKOYO JPN
51309HJ_IS_IS
42087IND MUMBAI IND
How can this be done in pandas python
CodePudding user response:
You can use the str.replace()
method on the temp_id
column and use regular expressions to match the pattern of the values you want to replace. Here is an example:
import re
df['temp_id'] = df['temp_id'].apply(lambda x: re.sub(r'^(\d )_.*', r'\1' df['city'] df['country'], x))
This uses a regular expression to match the pattern of the temp_id
values that you want to replace (in this case, any value that starts with one or more digits followed by an underscore), and replaces them with the matched digits concatenated with the values of the corresponding city and country columns. The result will be the temp_id column with the desired format.
CodePudding user response:
Use boolean indexing:
# find rows with value in country and city
m1 = df[['city', 'country']].notna().all(axis=1)
# find rows with a "_"
m2 = df['temp_id'].str.contains('_')
# both conditions above
m = m1&m2
# replace matching rows by number city country
df.loc[m, 'temp_id'] = (df.loc[m, 'temp_id'].str.extract('^(\d )', expand=False)
df.loc[m, 'city'].str.replace(' ', '') df.loc[m, 'country']
)
Output:
temp_id city country
0 12225IND DELHI IND
1 14445AUSTINUS AUSTIN US
2 56784SIN BEDOK SIN
3 72312NEWYORKUS NEW YORK US
4 47853DUB DUBAI UAE
5 80976SYDENYAUS SYDENY AUS
6 89012TOKOYOJPN TOKOYO JPN
7 51309HJ_IS_IS None None
8 42087IND MUMBAI IND