I want to create a column matching the values in an array column with a dict.
My dict:
{
'Paris': 'France',
'Amsterdam': 'Netherlands',
'Lisboa': 'Portugal',
'London': 'United Kingdom',
'Madrid': 'Spain',
'Berlin': 'Germany',
'Vienna': 'Austria'
}
My df:
First_col Cities
First_value ['Paris' , 'Lisboa', 'Barcelona']
Second_value ['Amsterdam', 'Madrid']
Thrid_value ['Lisboa', 'London', 'Amsterdam', 'Milan', 'Prague']
Fourth_value ['Rome', 'Lyon', 'Vienna']
Expected df:
First_col Cities Country
First_value ['Paris' , 'Lisboa', 'Barcelona'] ['France', 'Portugal', None]
Second_value ['Amsterdam', 'Madrid'] ['Netherlands', 'Spain']
Thrid_value ['Lisboa', 'London', 'Amsterdam', 'Milan', 'Prague'] ['Portugal', 'United Kingdom', 'Netherlands', None, 'Vienna']
Fourth_value ['Rome', 'Lyon', 'Oslo'] ['Italy', None, None]
CodePudding user response:
You could create a map column from dictionary using create_map
and then use transform
to change every existent array's value with the new one:
Inputs:
from pyspark.sql import functions as F
d = {'Paris': 'France', 'Amsterdam': 'Netherlands', 'Lisboa': 'Portugal', 'London': 'United Kingdom', 'Madrid': 'Spain', 'Berlin': 'Germany', 'Vienna': 'Austria'}
df = spark.createDataFrame(
[('First_value', ['Paris' , 'Lisboa', 'Barcelona']),
('Second_value', ['Amsterdam', 'Madrid']),
('Thrid_value', ['Lisboa', 'London', 'Amsterdam', 'Milan', 'Prague']),
('Fourth_value', ['Rome', 'Lyon', 'Vienna'])],
['First_col', 'Cities'])
Script:
map_col = F.create_map([F.lit(x) for i in d.items() for x in i])
df = df.withColumn("Country", F.transform('Cities', lambda x: map_col[x]))
df.show(truncate=0)
# ------------ ------------------------------------------ ---------------------------------------------------
# |First_col |Cities |Country |
# ------------ ------------------------------------------ ---------------------------------------------------
# |First_value |[Paris, Lisboa, Barcelona] |[France, Portugal, null] |
# |Second_value|[Amsterdam, Madrid] |[Netherlands, Spain] |
# |Thrid_value |[Lisboa, London, Amsterdam, Milan, Prague]|[Portugal, United Kingdom, Netherlands, null, null]|
# |Fourth_value|[Rome, Lyon, Vienna] |[null, null, Austria] |
# ------------ ------------------------------------------ ---------------------------------------------------