Home > Back-end >  How to create dataframe with struct column in PySpark without specifying a schema?
How to create dataframe with struct column in PySpark without specifying a schema?

Time:05-02

I am learning PySpark and it is convenient to be able to quickly create example dataframes to try the functionality of the PySpark API.

The following code (where spark is a spark session):

import pyspark.sql.types as T
df = [{'id': 1, 'data': {'x': 'mplah', 'y': [10,20,30]}},
      {'id': 2, 'data': {'x': 'mplah2', 'y': [100,200,300]}},
]
df = spark.createDataFrame(df)
df.printSchema()

gives a map (and does not interpret the array correctly):

root
 |-- data: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- id: long (nullable = true)

I needed a struct. I can force a struct if I give a schema:

import pyspark.sql.types as T
df = [{'id': 1, 'data': {'x': 'mplah', 'y': [10,20,30]}},
      {'id': 2, 'data': {'x': 'mplah2', 'y': [100,200,300]}},
]
schema = T.StructType([
    T.StructField('id', LongType()),
    T.StructField('data', StructType([
        StructField('x', T.StringType()),
        StructField('y', T.ArrayType(T.LongType())),
    ]) )
])
df = spark.createDataFrame(df, schema=schema)
df.printSchema()

That indeed gives:

root
 |-- id: long (nullable = true)
 |-- data: struct (nullable = true)
 |    |-- x: string (nullable = true)
 |    |-- y: array (nullable = true)
 |    |    |-- element: long (containsNull = true)

But this is too much typing.

Is there any other quick way to create the dataframe so that the data column is a struct without specifying the schema?

CodePudding user response:

I personally don't know if you can create structs implicitly like you want. But you can go without providing the schema by first creating the columns which would go into struct and then providing them to struct:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(1, 'mplah', [10,20,30]),
     (2, 'mplah2', [100,200,300])],
    ['id', 'x', 'y']
)

df = df.select('id', F.struct('x', 'y').alias('data'))

df.printSchema()
# root
#  |-- id: long (nullable = true)
#  |-- data: struct (nullable = false)
#  |    |-- x: string (nullable = true)
#  |    |-- y: array (nullable = true)
#  |    |    |-- element: long (containsNull = true)
  • Related