I have a dataframe having dictionary with multiple lists and I would like to create a dataframe by extracting on a certain element 'Student'.
1 | 2 | 3 | 4 |
---|---|---|---|
{"Student":["Grad","School"], "Comments": "Finished Education"} | {"Student":["New"], "Comments": "Started Education", Location : ["USA", "China", "Australia"]} | {"Student": ["Middle", "School"], "ID" : ["1000", "2000"]}} | {"Student": ["Med","School"]} |
Expected output:
Student |
---|
Grad, School |
New |
Middle, School |
Med, School |
I have tried to read the dataframe into a dictionary, but was unable to retrieve only the 'Student' element from the dictionary.
data_dict = {}
df = df.toPandas()
for column in df.columns:
data_dict[column] = df[column].values.tolist()
Student = [data for data in data_dict.values()]
CodePudding user response:
First of all, what you have is not dictionaries and not lists. When you have a Spark dataframe, what you call a dictionary is a struct. And what you call a list is an array. Data types can be inspected with df.printSchema()
.
You can extract the "Student" fields from the structs, then add everything to an array
in order to finally explode
.
arr = F.array([F.col(f'{c}.Student') for c in df.columns])
df = df.select(F.explode(arr).alias('Student'))
Full example:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[((["Grad","School"], "Finished Education"),(['New'], "Started Education", ["USA", "China", "Australia"]),(["Middle", "School"], ["1000", "2000"]),(["Med","School"],))],
'`1` struct<Student:array<string>,Comments:string>, `2` struct<Student:array<string>,Comments:string,Location:array<string>>, `3` struct<Student:array<string>,ID:array<string>>, `4` struct<Student:array<string>>')
arr = F.array([F.col(f'{c}.Student') for c in df.columns])
df = df.select(F.explode(arr).alias('Student'))
df.show()
# ----------------
# | Student|
# ----------------
# | [Grad, School]|
# | [New]|
# |[Middle, School]|
# | [Med, School]|
# ----------------
Another working option:
to_melt = [f"`{c}`.Student" for c in df.columns]
df = df.selectExpr(f"stack({len(to_melt)}, {','.join(to_melt)}) Student")