well, I have a dataframe with the following structure:
col1 - col2
a italy
b brazil
b wales
c china
d eua
d japan
and I need to transform it in something like this:
col1 - col2
a italy
b brazil, wales
c china
d eua, japan
how it's the best way to do this? I'm on databricks using spark, pyspark.. thanks!
CodePudding user response:
This is most easily done by creating a dictionary that maps the values in col1 to the corresponding values in col2:
col1 = ["a", "b", "b", "c", "d", "d"]
col2 = ["italy", "brazil", "wales", "china", "eua", "japan"]
data = {}
for key, val in zip(col1, col2):
if key not in data:
data[key] = [val]
else:
data[key].append(val)
print(data)
This yields: {'a': ['italy'], 'b': ['brazil', 'wales'], 'c': ['china'], 'd': ['eua', 'japan']}
If you then want your output in the original form (in two different columns). You just convert the dictionary to two lists like this:
col1, col2 = list(data.keys()), list(data.values())
print(col1)
print(col2)
Which would yield:
['a', 'b', 'c', 'd']
[['italy'], ['brazil', 'wales'], ['china'], ['eua', 'japan']]
CodePudding user response:
df = spark.createDataFrame(
[
('a', 'italy'),
('b', 'brazil'),
('b', 'wales')
],
schema=['col1', 'col2']
)
df.show(3, False)
---- ------
|col1|col2 |
---- ------
|a |italy |
|b |brazil|
|b |wales |
---- ------
You can use groupby
, concat_ws
and collect_list
:
df\
.groupby('col1')\
.agg(func.concat_ws(', ', func.collect_list('col2')).alias('col2'))\
.show(10, False)
---- -------------
|col1|col2 |
---- -------------
|a |italy |
|b |brazil, wales|
---- -------------