Convert a column with array to separate columns using pyspark-CodePudding

I have a data frame like this

 ------------ ----------------- ------------------------------- 
| Name       |   Age           | Answers                       |
 ------------ ----------------- ------------------------------- 
| Maria      | 23              | [apple, mango, orange, banana]| 
| John       | 55              | [apple, orange, banana]       |
| Brad       | 44              | [banana]                      |
 ------------ ----------------- -------------------------------

The answers column contains an array of elements

Expected Output

 ------------ ----------------- ------------------------------- 
| Name       |   Age           | apple | mango |orange| banana |
 ------------ ----------------- ------------------------------- 
| Maria      | 23              |  True |  True | True | True   |
| John       | 55              |  True |  False| True | True   |
| Brad       | 44              | False | False | False| True   |
 ------------ ----------------- -------------------------------

Is there a way where I can convert the array column into True and False columns?

Thanks in advance.

CodePudding user response：

If you don't know in advance all the possible values of the Answers array, you can resort to the following solution that uses explode pivot.

df \
    .withColumn("answer", F.explode("Answers")) \
    .drop("Answers") \
    .groupBy("Name", "Age") \
    .pivot("answer") \
    .agg(F.first("answer").isNotNull()) \
    .na \
    .fill(False)

It is heavy, as is the pivot if you don't know in advantage the possible values.

Solution explained

1. Array explode

.withColumn("answer", F.explode("Answers")) \
.drop("Answers")

The explode function will create N different rows for each value in the Answers array.

 ----- --- --------                                                               
| Name|Age|  answer|
 ----- --- -------- 
|Maria| 23|   apple|
|Maria| 23|   mango|
|Maria| 23|  orange|
|Maria| 23|  banana|
| John| 55|   apple|
| John| 55|  orange|
| John| 55|  banana|
| Brad| 44|  banana|
 ----- --- --------

2. Pivot

.groupBy("Name", "Age") \
.pivot("answer") \
.agg(F.first("answer").isNotNull())

Transpose rows in columns by creating a column for each distinct value in the answer column.

It would be safer if you had a column that uniquely identifies each row (i.e. an id) and you used that in the groupBy.

 ----- --- ----- ------ ----- ------ 
| Name|Age|apple|banana|mango|orange|
 ----- --- ----- ------ ----- ------ 
| John| 55| true|  true| null|  true|
|Maria| 23| true|  true| true|  true|
| Brad| 44| null|  true| null|  null|
 ----- --- ----- ------ ----- ------

3. Fix missing values

Missing values are reported as null during the pivot. Replace all nulls with false as required.

.na \
.fill(False)

 ----- --- ----- ------ ----- ------ 
| Name|Age|apple|banana|mango|orange|
 ----- --- ----- ------ ----- ------ 
| John| 55| true|  true|false|  true|
|Maria| 23| true|  true| true|  true|
| Brad| 44|false|  true|false| false|
 ----- --- ----- ------ ----- ------

CodePudding user response：

A possible solution, knowing the list of all the possible answers, is to create a column for each of them, stating if the column 'Answers' contains that particular answer for that row.

Suppose that the list of possible answers is called possible_answers (so, in your case is ['orange', 'apple', 'mango', 'banana']), then the following code produces the DataFrame you want (suppose that df is your input DataFrame):

import re
from pyspark.sql import functions as F

def normalize_column_name(name):
    """Normalize column name backticking names with invalid characters"""
    return (f'`{name}`' if re.search(r'[_|\.|\(|\/]', name) else name)
    # if you prefer, you can replace the invalid characters with a valid one, e.g., '-'
    # return re.sub(r"(_|\.|\(|\/)", "_", name)

for c in sorted(possible_answers):  # sorted is optional, but guarantees the order of the columns
    df = df.withColumn(normalize_column_name(c), F.array_contains('Answers', c))
df = df.drop('Answers')

If you do not know beforehand all the possible answers, you can infer them from the input DataFrame:

possible_answers = [r[0] for r in df.select(F.explode('Answers')).distinct().collect()]

Here an example with a DataFrame containing an answer with invalid characters for a column name:

 ----- --- ---------------------------------------- 
|Name |Age|Answers                                 |
 ----- --- ---------------------------------------- 
|Maria|23 |[apple, mango, orange, banana]          |
|John |55 |[apple, orange, banana, lime.strawberry]|
|Brad |44 |[banana, nut/ pear]                     |
 ----- --- ----------------------------------------

and here the result:

 ----- --- ----- ------ ----------------- ----- ----------- ------ 
|Name |Age|apple|banana|`lime.strawberry`|mango|`nut/ pear`|orange|
 ----- --- ----- ------ ----------------- ----- ----------- ------ 
|Maria|23 |true |true  |false            |true |false      |true  |
|John |55 |true |true  |true             |false|false      |true  |
|Brad |44 |false|true  |false            |false|true       |false |
 ----- --- ----- ------ ----------------- ----- ----------- ------