Home > front end >  Convert a column with array to separate columns using pyspark
Convert a column with array to separate columns using pyspark

Time:11-02

I have a data frame like this

 ------------ ----------------- ------------------------------- 
| Name       |   Age           | Answers                       |
 ------------ ----------------- ------------------------------- 
| Maria      | 23              | [apple, mango, orange, banana]| 
| John       | 55              | [apple, orange, banana]       |
| Brad       | 44              | [banana]                      |
 ------------ ----------------- ------------------------------- 

The answers column contains an array of elements

Expected Output

 ------------ ----------------- ------------------------------- 
| Name       |   Age           | apple | mango |orange| banana |
 ------------ ----------------- ------------------------------- 
| Maria      | 23              |  True |  True | True | True   |
| John       | 55              |  True |  False| True | True   |
| Brad       | 44              | False | False | False| True   |
 ------------ ----------------- ------------------------------- 

Is there a way where I can convert the array column into True and False columns?

Thanks in advance.

CodePudding user response:

If you don't know in advance all the possible values of the Answers array, you can resort to the following solution that uses explode pivot.

df \
    .withColumn("answer", F.explode("Answers")) \
    .drop("Answers") \
    .groupBy("Name", "Age") \
    .pivot("answer") \
    .agg(F.first("answer").isNotNull()) \
    .na \
    .fill(False)

It is heavy, as is the pivot if you don't know in advantage the possible values.


Solution explained

1. Array explode

.withColumn("answer", F.explode("Answers")) \
.drop("Answers")

The explode function will create N different rows for each value in the Answers array.

 ----- --- --------                                                               
| Name|Age|  answer|
 ----- --- -------- 
|Maria| 23|   apple|
|Maria| 23|   mango|
|Maria| 23|  orange|
|Maria| 23|  banana|
| John| 55|   apple|
| John| 55|  orange|
| John| 55|  banana|
| Brad| 44|  banana|
 ----- --- -------- 

2. Pivot

.groupBy("Name", "Age") \
.pivot("answer") \
.agg(F.first("answer").isNotNull())

Transpose rows in columns by creating a column for each distinct value in the answer column.

It would be safer if you had a column that uniquely identifies each row (i.e. an id) and you used that in the groupBy.

 ----- --- ----- ------ ----- ------ 
| Name|Age|apple|banana|mango|orange|
 ----- --- ----- ------ ----- ------ 
| John| 55| true|  true| null|  true|
|Maria| 23| true|  true| true|  true|
| Brad| 44| null|  true| null|  null|
 ----- --- ----- ------ ----- ------ 

3. Fix missing values

Missing values are reported as null during the pivot. Replace all nulls with false as required.

.na \
.fill(False)
 ----- --- ----- ------ ----- ------ 
| Name|Age|apple|banana|mango|orange|
 ----- --- ----- ------ ----- ------ 
| John| 55| true|  true|false|  true|
|Maria| 23| true|  true| true|  true|
| Brad| 44|false|  true|false| false|
 ----- --- ----- ------ ----- ------ 

CodePudding user response:

A possible solution, knowing the list of all the possible answers, is to create a column for each of them, stating if the column 'Answers' contains that particular answer for that row.

Suppose that the list of possible answers is called possible_answers (so, in your case is ['orange', 'apple', 'mango', 'banana']), then the following code produces the DataFrame you want (suppose that df is your input DataFrame):

import re
from pyspark.sql import functions as F

def normalize_column_name(name):
    """Normalize column name backticking names with invalid characters"""
    return (f'`{name}`' if re.search(r'[_|\.|\(|\/]', name) else name)
    # if you prefer, you can replace the invalid characters with a valid one, e.g., '-'
    # return re.sub(r"(_|\.|\(|\/)", "_", name)

for c in sorted(possible_answers):  # sorted is optional, but guarantees the order of the columns
    df = df.withColumn(normalize_column_name(c), F.array_contains('Answers', c))
df = df.drop('Answers')

If you do not know beforehand all the possible answers, you can infer them from the input DataFrame:

possible_answers = [r[0] for r in df.select(F.explode('Answers')).distinct().collect()]

Here an example with a DataFrame containing an answer with invalid characters for a column name:

 ----- --- ---------------------------------------- 
|Name |Age|Answers                                 |
 ----- --- ---------------------------------------- 
|Maria|23 |[apple, mango, orange, banana]          |
|John |55 |[apple, orange, banana, lime.strawberry]|
|Brad |44 |[banana, nut/ pear]                     |
 ----- --- ---------------------------------------- 

and here the result:

 ----- --- ----- ------ ----------------- ----- ----------- ------ 
|Name |Age|apple|banana|`lime.strawberry`|mango|`nut/ pear`|orange|
 ----- --- ----- ------ ----------------- ----- ----------- ------ 
|Maria|23 |true |true  |false            |true |false      |true  |
|John |55 |true |true  |true             |false|false      |true  |
|Brad |44 |false|true  |false            |false|true       |false |
 ----- --- ----- ------ ----------------- ----- ----------- ------ 
  • Related