Home > front end >  Map list of multiple substrings in PySpark
Map list of multiple substrings in PySpark

Time:09-13

I have a dataframe with one column like this:

Locations
Germany:city_Berlin
France:town_Montpellier
Italy:village_Amalfi

I would like to get rid of the substrings: 'city_', 'town_', 'village_', etc.

So the output should be:

Locations
Germany:Berlin
France:tMontpellier
Italy:Amalfi

I can get rid of one of them this way: F.regexp_replace('Locations', 'city_', '')

Is there a similar way to pass several substrings to remove from the original column?

Ideally I'm looking for a one line solution, without having to create separate functions or convoluted things.

CodePudding user response:

I wouldnt map. Looks to me like you want to replace strings immediately to the left of : if they end with _. If so use regex. Code below

df.withColumn('new_Locations', regexp_replace('Locations', '(?<=\:)[a-z_] ','')).show(truncate=False)


 --- ----------------------- ------------------ 
|id |Locations              |new_Locations     |
 --- ----------------------- ------------------ 
|1  |Germany:city_Berlin    |Germany:Berlin    |
|2  |France:town_Montpellier|France:Montpellier|
|4  |Italy:village_Amalfi   |Italy:Amalfi      |
 --- ----------------------- ------------------ 

CodePudding user response:

F.regexp_replace('Locations', r'(?<=:).*_', '')

.* tells that you will match all characters. But it is located between (?<=:) and _.

_ is the symbol which must follow all the characters matched by .*.

(?<=:) is a syntax for "positive lookbehind". It is not a part of a match, but it ensures that right before the .*_ you must have a : symbol.

  • Related