I have a data_frame as below,
Id | Col1 |
---|---|
1 | [["A", "B", "E", "F"]] |
2 | [["A", "D", "E"]] |
I have a list as ["A", "B", "C"]
I would like to add the elements in the list as columns and check if the exist in col1 or not. So my expected output will be like,
Id | Col1 | A | B | C |
---|---|---|---|---|
1 | [["A", "B", "E", "F"]] | 1 | 1 | 0 |
2 | [["A", "D", "E"]] | 1 | 0 | 0 |
I tried the below code to check if any one of the values in the list exist in Col1 but not sure how to do that for each of the values in the list.
list_exist = data_frame.withColumn("list",F.array([F.lit(i) for i in list]))\
.withColumn("list_exist",F.arrays_overlap(F.col("Col1"),F.col("list")))\
.drop("list")
I'm new to PySpark so any help is much appreciated. Thanks!
CodePudding user response:
This can be achieved using a list comprehension
.
ls = ["A", "B", "C"]
...
df = df.select('*', *[F.array_contains('col1', c).cast('int').alias(c) for c in ls])