regexp_replace / not contains certain strings-CodePudding

I am trying to make a new column by replacing certain strings.

df = df.withColumn('new_key_location', F.regexp_replace('key_location'), ^'Seoul|Jeju|Busan', 'Others')

I want to change other location names in the key_location column not containing 'Seoul|Jeju|Busan' strings to 'Others'.

What can I do?

CodePudding user response：

You could just use .isin or .like with a when().otherwise() instead of regexp_replace.

reqd_cities = ['Seoul', 'Jeju', 'Busan']

data_sdf. \
    withColumn('new_key_location', 
               func.when(func.col('key_location').isin(reqd_cities), func.col('key_location').
                         otherwise(func.lit('Others'))
                         )
               )

data_sdf. \
    withColumn('new_key_location', 
               func.when(func.col('key_location').like('%Seoul%') | 
                         func.col('key_location').like('%Jeju%') |
                         func.col('key_location').like('%Busan%'), func.col('key_location')).
               otherwise(func.lit('Others'))
               )

CodePudding user response：

It might be more straightforward to use df.apply:

important_cities = ['Seoul', 'Jeju', 'Busan']
df["new_key_location"] = df["key_location"].apply(lambda x: x if x in important_cities else "Others")

but not matching regex should be

'^(?!(Seoul|Jeju|Busan))'