I am trying to make a new column by replacing certain strings.
df = df.withColumn('new_key_location', F.regexp_replace('key_location'), ^'Seoul|Jeju|Busan', 'Others')
I want to change other location names in the key_location
column not containing 'Seoul|Jeju|Busan' strings to 'Others'.
What can I do?
CodePudding user response:
You could just use .isin
or .like
with a when().otherwise()
instead of regexp_replace
.
reqd_cities = ['Seoul', 'Jeju', 'Busan']
data_sdf. \
withColumn('new_key_location',
func.when(func.col('key_location').isin(reqd_cities), func.col('key_location').
otherwise(func.lit('Others'))
)
)
OR
data_sdf. \
withColumn('new_key_location',
func.when(func.col('key_location').like('%Seoul%') |
func.col('key_location').like('%Jeju%') |
func.col('key_location').like('%Busan%'), func.col('key_location')).
otherwise(func.lit('Others'))
)
CodePudding user response:
It might be more straightforward to use df.apply
:
important_cities = ['Seoul', 'Jeju', 'Busan']
df["new_key_location"] = df["key_location"].apply(lambda x: x if x in important_cities else "Others")
but not matching regex should be
'^(?!(Seoul|Jeju|Busan))'