How to replace a part of column value with values from another two columns based on a condition in p-CodePudding

I have a dataframe df as shown below. I want replace all the temp_idcolumn values which are having _(underscore with another value which combination of numerical part of the temp_id city country column values.

    temp_id          city    country
    12225IND         DELHI    IND
    14445UX_TY       AUSTIN   US
    56784SIN         BEDOK    SIN
    72312SD_IT_UZ    NEW YORK US
    47853DUB         DUBAI    UAE
    80976UT_IS_SZ    SYDENY   AUS
    89012TY_JP_IS    TOKOYO   JPN
    51309HJ_IS_IS
    42087IND          MUMBAI  IND

Expected Output

temp_id          city    country
12225IND         DELHI    IND
14445AUSTINUS    AUSTIN   US
56784SIN         BEDOK    SIN
72312NEWYORKUS   NEW YORK US
47853DUB         DUBAI    UAE
80976SYDENYAUS   SYDENY   AUS
89012TOKOYOJPN   TOKOYO   JPN
51309HJ_IS_IS
42087IND          MUMBAI  IND

How can this be done in pandas python

CodePudding user response：

You can use the str.replace() method on the temp_id column and use regular expressions to match the pattern of the values you want to replace. Here is an example:

import re

df['temp_id'] = df['temp_id'].apply(lambda x: re.sub(r'^(\d )_.*', r'\1' df['city'] df['country'], x))

This uses a regular expression to match the pattern of the temp_id values that you want to replace (in this case, any value that starts with one or more digits followed by an underscore), and replaces them with the matched digits concatenated with the values of the corresponding city and country columns. The result will be the temp_id column with the desired format.

CodePudding user response：

Use boolean indexing:

# find rows with value in country and city
m1 = df[['city', 'country']].notna().all(axis=1)
# find rows with a "_"
m2 = df['temp_id'].str.contains('_')

# both conditions above
m = m1&m2

# replace matching rows by number   city   country
df.loc[m, 'temp_id'] = (df.loc[m, 'temp_id'].str.extract('^(\d )', expand=False)
                         df.loc[m, 'city'].str.replace(' ', '') df.loc[m, 'country']
                       )

Output:

          temp_id      city country
0        12225IND     DELHI     IND
1   14445AUSTINUS    AUSTIN      US
2        56784SIN     BEDOK     SIN
3  72312NEWYORKUS  NEW YORK      US
4        47853DUB     DUBAI     UAE
5  80976SYDENYAUS    SYDENY     AUS
6  89012TOKOYOJPN    TOKOYO     JPN
7   51309HJ_IS_IS      None    None
8        42087IND    MUMBAI     IND