Combine two columns with comma delimited strings into one column with joined unique-CodePudding

Lets say I have a dataframe as following:

| id| col 1 |col 2
| 1 | "A,B" |"A,D"
| 2 | "A,B" |"X,Y"
| 3 | "B,D" |"B,D"

I need an output as:

| id|merged(col1 and col2)
| 1 | "A,B,D"
| 2 | "A,B,X,Y"
| 3 | "B,D"

CodePudding user response：

Split the columns col1 and col2 to get arrays, then union using array_union and finally array_join to get a string:

import pyspark.sql.functions as F

df = spark.createDataFrame([
    (1, "A,B", "A,D"),
    (2, "A,B", "X,Y"),
    (3, "B,D", "B,D")
], ["id", "col1", "col2"])

df1 = df.select(
    "id",
    F.array_join(
        F.array_union(F.split("col1", ","), F.split("col2", ",")),
        ","
    ).alias("merged_col1_col2")
)

df1.show()
# --- ---------------- 
#| id|merged_col1_col2|
# --- ---------------- 
#|  1|           A,B,D|
#|  2|         A,B,X,Y|
#|  3|             B,D|
# --- ----------------

CodePudding user response：

You need to create a pandas function to handle the merging task, then use the apply function to add your function to merge the columns. Like this

import pandas as pd

def h(a,b):
  a = a.split(",")
  b = b.split(",")
  u = a   b
  return ','.join(sorted(list(set(u)))) 

df = pd.DataFrame({'col 1': ["A,B", "A,B", "B,D"], 'col 2': ["A,D", 
  "X,Y", "B,D"]})

df['merged'] = df.apply( lambda row: h(row['col 1'], row['col 2']), 
  axis = 1 )
df.head()