Lets say I have a dataframe as following:
| id| col 1 |col 2
| 1 | "A,B" |"A,D"
| 2 | "A,B" |"X,Y"
| 3 | "B,D" |"B,D"
I need an output as:
| id|merged(col1 and col2)
| 1 | "A,B,D"
| 2 | "A,B,X,Y"
| 3 | "B,D"
CodePudding user response:
Split the columns col1
and col2
to get arrays, then union using array_union
and finally array_join
to get a string:
import pyspark.sql.functions as F
df = spark.createDataFrame([
(1, "A,B", "A,D"),
(2, "A,B", "X,Y"),
(3, "B,D", "B,D")
], ["id", "col1", "col2"])
df1 = df.select(
"id",
F.array_join(
F.array_union(F.split("col1", ","), F.split("col2", ",")),
","
).alias("merged_col1_col2")
)
df1.show()
# --- ----------------
#| id|merged_col1_col2|
# --- ----------------
#| 1| A,B,D|
#| 2| A,B,X,Y|
#| 3| B,D|
# --- ----------------
CodePudding user response:
You need to create a pandas function to handle the merging task, then use the apply function to add your function to merge the columns. Like this
import pandas as pd
def h(a,b):
a = a.split(",")
b = b.split(",")
u = a b
return ','.join(sorted(list(set(u))))
df = pd.DataFrame({'col 1': ["A,B", "A,B", "B,D"], 'col 2': ["A,D",
"X,Y", "B,D"]})
df['merged'] = df.apply( lambda row: h(row['col 1'], row['col 2']),
axis = 1 )
df.head()