PySpark create dataframe with column type dictionary-CodePudding

I want to create a simple pyspark dataframe with 1 column that is a dictionary. I created the schema for the groups column and created 1 row.

schema = T.StructType([
        T.StructField(
            'groups', T.ArrayType(
                T.StructType([
                    T.StructField("types", T.ArrayType(T.StringType(), False)),
                    T.StructField("label", T.StringType())
   
                ]),

            )
        )
])

groups_rows = [{
    "groups": [
        {
            "types": ["baseball", "basketball"],
            "label": "Label 1"
        },
        {
            "types": ["football"],
            "label": "Label 2"
        }
    ]
}]


data = [groups_rows]

sections_df = spark.createDataFrame(data=data, schema=schema)

When I initialize the dataframe I get a type error:

TypeError: field groups: ArrayType(StructType(List(StructField(types,ArrayType(StringType,false),true),StructField(label,StringType,true))),true) can not accept object {'groups': [{'types': ['baseball', 'basketball'], 'label': 'Label 1'}, {'types': ['football'], 'label': 'Label 2'}]} in type <class 'dict'>

What is the cause of this error? What should I be doing differently in terms of setting up this dataframe? Should I use a MapType?

CodePudding user response：

This worked for me:

from pyspark.sql import types as T

schema = T.StructType([
  T.StructField('groups', 
                T.ArrayType(
                  T.StructType([
                    T.StructField('label', T.StringType(), True), 
                    T.StructField('types', 
                                  T.ArrayType(T.StringType(), True),
                                  True)
                  ]), True), 
                True)
])

groups_rows = [{
    "groups": [
        {
            "types": ["baseball", "basketball"],
            "label": "Label 1"
        },
        {
            "types": ["football"],
            "label": "Label 2"
        }
    ]
}]


data = groups_rows

sections_df = spark.createDataFrame(data=data, schema=schema)

It looks like grouping into [group_rows] was causing an issue. Why were you doing that?

CodePudding user response：

This is a json . Python has set methods to read it. Doing schema may not be necessary. Have you tried?

df = spark.read.json(sc.parallelize(groups_rows)).na.fill('') 
df.printSchema()

root
 |-- groups: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- label: string (nullable = true)
 |    |    |-- types: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)

Can proceed and select as required

df.selectExpr('inline(groups)').show(truncate=False)

 ------- ---------------------- 
|label  |types                 |
 ------- ---------------------- 
|Label 1|[baseball, basketball]|
|Label 2|[football]            |
 ------- ----------------------

Alternatively, write the json to a file and read it using the Databricks utilities. Code below

dbutils.fs.put("/tmp/groupd.json", """
{
    "groups": [
        {
            "types": ["baseball", "basketball"],
            "label": "Label 1"
        },
        {
            "types": ["football"],
            "label": "Label 2"
        }
    ]
}
""", True)

#j_nested=spark.read.json('/tmp/test.json')
spark.read.option('multiline',True).option('mode','permissive').json('/tmp/groupd.json').show()