I want to create a simple pyspark dataframe with 1 column that is a dictionary. I created the schema for the groups
column and created 1 row.
schema = T.StructType([
T.StructField(
'groups', T.ArrayType(
T.StructType([
T.StructField("types", T.ArrayType(T.StringType(), False)),
T.StructField("label", T.StringType())
]),
)
)
])
groups_rows = [{
"groups": [
{
"types": ["baseball", "basketball"],
"label": "Label 1"
},
{
"types": ["football"],
"label": "Label 2"
}
]
}]
data = [groups_rows]
sections_df = spark.createDataFrame(data=data, schema=schema)
When I initialize the dataframe I get a type error:
TypeError: field groups: ArrayType(StructType(List(StructField(types,ArrayType(StringType,false),true),StructField(label,StringType,true))),true) can not accept object {'groups': [{'types': ['baseball', 'basketball'], 'label': 'Label 1'}, {'types': ['football'], 'label': 'Label 2'}]} in type <class 'dict'>
What is the cause of this error? What should I be doing differently in terms of setting up this dataframe? Should I use a MapType
?
CodePudding user response:
This worked for me:
from pyspark.sql import types as T
schema = T.StructType([
T.StructField('groups',
T.ArrayType(
T.StructType([
T.StructField('label', T.StringType(), True),
T.StructField('types',
T.ArrayType(T.StringType(), True),
True)
]), True),
True)
])
groups_rows = [{
"groups": [
{
"types": ["baseball", "basketball"],
"label": "Label 1"
},
{
"types": ["football"],
"label": "Label 2"
}
]
}]
data = groups_rows
sections_df = spark.createDataFrame(data=data, schema=schema)
It looks like grouping into [group_rows]
was causing an issue. Why were you doing that?
CodePudding user response:
This is a json . Python has set methods to read it. Doing schema may not be necessary. Have you tried?
df = spark.read.json(sc.parallelize(groups_rows)).na.fill('')
df.printSchema()
root
|-- groups: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- label: string (nullable = true)
| | |-- types: array (nullable = true)
| | | |-- element: string (containsNull = true)
Can proceed and select as required
df.selectExpr('inline(groups)').show(truncate=False)
------- ----------------------
|label |types |
------- ----------------------
|Label 1|[baseball, basketball]|
|Label 2|[football] |
------- ----------------------
Alternatively, write the json to a file and read it using the Databricks utilities. Code below
dbutils.fs.put("/tmp/groupd.json", """
{
"groups": [
{
"types": ["baseball", "basketball"],
"label": "Label 1"
},
{
"types": ["football"],
"label": "Label 2"
}
]
}
""", True)
#j_nested=spark.read.json('/tmp/test.json')
spark.read.option('multiline',True).option('mode','permissive').json('/tmp/groupd.json').show()