I have a data frame like this
------------ ----------------- -------------------------------
| Name | Age | Answers |
------------ ----------------- -------------------------------
| Maria | 23 | [apple, mango, orange, banana]|
| John | 55 | [apple, orange, banana] |
| Brad | 44 | [banana] |
------------ ----------------- -------------------------------
The answers column contains an array of elements
Expected Output
------------ ----------------- -------------------------------
| Name | Age | apple | mango |orange| banana |
------------ ----------------- -------------------------------
| Maria | 23 | True | True | True | True |
| John | 55 | True | False| True | True |
| Brad | 44 | False | False | False| True |
------------ ----------------- -------------------------------
Is there a way where I can convert the array column into True and False columns?
Thanks in advance.
CodePudding user response:
If you don't know in advance all the possible values of the Answers
array, you can resort to the following solution that uses explode
pivot
.
df \
.withColumn("answer", F.explode("Answers")) \
.drop("Answers") \
.groupBy("Name", "Age") \
.pivot("answer") \
.agg(F.first("answer").isNotNull()) \
.na \
.fill(False)
It is heavy, as is the pivot if you don't know in advantage the possible values.
Solution explained
1. Array explode
.withColumn("answer", F.explode("Answers")) \
.drop("Answers")
The explode
function will create N different rows for each value in the Answers
array.
----- --- --------
| Name|Age| answer|
----- --- --------
|Maria| 23| apple|
|Maria| 23| mango|
|Maria| 23| orange|
|Maria| 23| banana|
| John| 55| apple|
| John| 55| orange|
| John| 55| banana|
| Brad| 44| banana|
----- --- --------
2. Pivot
.groupBy("Name", "Age") \
.pivot("answer") \
.agg(F.first("answer").isNotNull())
Transpose rows in columns by creating a column for each distinct value in the answer
column.
It would be safer if you had a column that uniquely identifies each row (i.e. an id) and you used that in the groupBy
.
----- --- ----- ------ ----- ------
| Name|Age|apple|banana|mango|orange|
----- --- ----- ------ ----- ------
| John| 55| true| true| null| true|
|Maria| 23| true| true| true| true|
| Brad| 44| null| true| null| null|
----- --- ----- ------ ----- ------
3. Fix missing values
Missing values are reported as null
during the pivot.
Replace all null
s with false
as required.
.na \
.fill(False)
----- --- ----- ------ ----- ------
| Name|Age|apple|banana|mango|orange|
----- --- ----- ------ ----- ------
| John| 55| true| true|false| true|
|Maria| 23| true| true| true| true|
| Brad| 44|false| true|false| false|
----- --- ----- ------ ----- ------
CodePudding user response:
A possible solution, knowing the list of all the possible answers, is to create a column for each of them, stating if the column 'Answers' contains that particular answer for that row.
Suppose that the list of possible answers is called possible_answers
(so, in your case is ['orange', 'apple', 'mango', 'banana']
), then the following code produces the DataFrame
you want (suppose that df
is your input DataFrame
):
import re
from pyspark.sql import functions as F
def normalize_column_name(name):
"""Normalize column name backticking names with invalid characters"""
return (f'`{name}`' if re.search(r'[_|\.|\(|\/]', name) else name)
# if you prefer, you can replace the invalid characters with a valid one, e.g., '-'
# return re.sub(r"(_|\.|\(|\/)", "_", name)
for c in sorted(possible_answers): # sorted is optional, but guarantees the order of the columns
df = df.withColumn(normalize_column_name(c), F.array_contains('Answers', c))
df = df.drop('Answers')
If you do not know beforehand all the possible answers, you can infer them from the input DataFrame
:
possible_answers = [r[0] for r in df.select(F.explode('Answers')).distinct().collect()]
Here an example with a DataFrame
containing an answer with invalid characters for a column name:
----- --- ----------------------------------------
|Name |Age|Answers |
----- --- ----------------------------------------
|Maria|23 |[apple, mango, orange, banana] |
|John |55 |[apple, orange, banana, lime.strawberry]|
|Brad |44 |[banana, nut/ pear] |
----- --- ----------------------------------------
and here the result:
----- --- ----- ------ ----------------- ----- ----------- ------
|Name |Age|apple|banana|`lime.strawberry`|mango|`nut/ pear`|orange|
----- --- ----- ------ ----------------- ----- ----------- ------
|Maria|23 |true |true |false |true |false |true |
|John |55 |true |true |true |false|false |true |
|Brad |44 |false|true |false |false|true |false |
----- --- ----- ------ ----------------- ----- ----------- ------