how to convert itens of a column into an array or list?-CodePudding

well, I have a dataframe with the following structure:

col1     -     col2
 a             italy
 b             brazil
 b             wales
 c             china
 d             eua
 d             japan

and I need to transform it in something like this:

col1     -     col2
 a             italy
 b          brazil, wales
 c             china
 d           eua, japan

how it's the best way to do this? I'm on databricks using spark, pyspark.. thanks!

CodePudding user response：

This is most easily done by creating a dictionary that maps the values in col1 to the corresponding values in col2:

col1 = ["a", "b", "b", "c", "d", "d"]
col2 = ["italy", "brazil", "wales", "china", "eua", "japan"]


data = {}

for key, val in zip(col1, col2):
    if key not in data:
        data[key] = [val]
    else:
        data[key].append(val)

print(data)

This yields: {'a': ['italy'], 'b': ['brazil', 'wales'], 'c': ['china'], 'd': ['eua', 'japan']}

If you then want your output in the original form (in two different columns). You just convert the dictionary to two lists like this:

col1, col2 = list(data.keys()), list(data.values())

print(col1)
print(col2)

Which would yield:

['a', 'b', 'c', 'd']

[['italy'], ['brazil', 'wales'], ['china'], ['eua', 'japan']]

CodePudding user response：

df = spark.createDataFrame(
    [
        ('a', 'italy'),
        ('b', 'brazil'),
        ('b', 'wales')
    ],
    schema=['col1', 'col2']
)

df.show(3, False)
 ---- ------ 
|col1|col2  |
 ---- ------ 
|a   |italy |
|b   |brazil|
|b   |wales |
 ---- ------

You can use groupby, concat_ws and collect_list:

df\
    .groupby('col1')\
    .agg(func.concat_ws(', ', func.collect_list('col2')).alias('col2'))\
    .show(10, False)
 ---- ------------- 
|col1|col2         |
 ---- ------------- 
|a   |italy        |
|b   |brazil, wales|
 ---- -------------