Home > Software design >  Replace array values based on dictionary
Replace array values based on dictionary

Time:10-19

I want to create a column matching the values in an array column with a dict.

My dict:

{
  'Paris': 'France',
  'Amsterdam': 'Netherlands',
  'Lisboa': 'Portugal',
  'London': 'United Kingdom',
  'Madrid': 'Spain',
  'Berlin': 'Germany',
  'Vienna': 'Austria'
}

My df:

First_col     Cities         
First_value   ['Paris' , 'Lisboa', 'Barcelona']        
Second_value  ['Amsterdam', 'Madrid']    
Thrid_value   ['Lisboa', 'London', 'Amsterdam', 'Milan', 'Prague']
Fourth_value  ['Rome', 'Lyon', 'Vienna']

Expected df:

First_col     Cities                                               Country
First_value   ['Paris' , 'Lisboa', 'Barcelona']                    ['France', 'Portugal', None]
Second_value  ['Amsterdam', 'Madrid']                              ['Netherlands', 'Spain']
Thrid_value   ['Lisboa', 'London', 'Amsterdam', 'Milan', 'Prague'] ['Portugal', 'United Kingdom', 'Netherlands', None, 'Vienna']
Fourth_value  ['Rome', 'Lyon', 'Oslo']                             ['Italy', None, None]

CodePudding user response:

You could create a map column from dictionary using create_map and then use transform to change every existent array's value with the new one:

Inputs:

from pyspark.sql import functions as F
d = {'Paris': 'France', 'Amsterdam': 'Netherlands', 'Lisboa': 'Portugal', 'London': 'United Kingdom', 'Madrid': 'Spain', 'Berlin': 'Germany', 'Vienna': 'Austria'}
df = spark.createDataFrame(
    [('First_value', ['Paris' , 'Lisboa', 'Barcelona']),
     ('Second_value', ['Amsterdam', 'Madrid']),
     ('Thrid_value', ['Lisboa', 'London', 'Amsterdam', 'Milan', 'Prague']),
     ('Fourth_value', ['Rome', 'Lyon', 'Vienna'])],
    ['First_col', 'Cities'])

Script:

map_col = F.create_map([F.lit(x) for i in d.items() for x in i])
df = df.withColumn("Country", F.transform('Cities', lambda x: map_col[x]))

df.show(truncate=0)
#  ------------ ------------------------------------------ --------------------------------------------------- 
# |First_col   |Cities                                    |Country                                            |
#  ------------ ------------------------------------------ --------------------------------------------------- 
# |First_value |[Paris, Lisboa, Barcelona]                |[France, Portugal, null]                           |
# |Second_value|[Amsterdam, Madrid]                       |[Netherlands, Spain]                               |
# |Thrid_value |[Lisboa, London, Amsterdam, Milan, Prague]|[Portugal, United Kingdom, Netherlands, null, null]|
# |Fourth_value|[Rome, Lyon, Vienna]                      |[null, null, Austria]                              |
#  ------------ ------------------------------------------ --------------------------------------------------- 
  • Related