I have a dataframe with one column like this:
Locations |
---|
Germany:city_Berlin |
France:town_Montpellier |
Italy:village_Amalfi |
I would like to get rid of the substrings: 'city_', 'town_', 'village_', etc.
So the output should be:
Locations |
---|
Germany:Berlin |
France:tMontpellier |
Italy:Amalfi |
I can get rid of one of them this way:
F.regexp_replace('Locations', 'city_', '')
Is there a similar way to pass several substrings to remove from the original column?
Ideally I'm looking for a one line solution, without having to create separate functions or convoluted things.
CodePudding user response:
I wouldnt map. Looks to me like you want to replace strings immediately to the left of :
if they end with _
. If so use regex. Code below
df.withColumn('new_Locations', regexp_replace('Locations', '(?<=\:)[a-z_] ','')).show(truncate=False)
--- ----------------------- ------------------
|id |Locations |new_Locations |
--- ----------------------- ------------------
|1 |Germany:city_Berlin |Germany:Berlin |
|2 |France:town_Montpellier|France:Montpellier|
|4 |Italy:village_Amalfi |Italy:Amalfi |
--- ----------------------- ------------------
CodePudding user response:
F.regexp_replace('Locations', r'(?<=:).*_', '')
.*
tells that you will match all characters. But it is located between (?<=:)
and _
.
_
is the symbol which must follow all the characters matched by .*
.
(?<=:)
is a syntax for "positive lookbehind". It is not a part of a match, but it ensures that right before the .*_
you must have a :
symbol.