Home > database >  Why isn't PySpark JSON writing all keys to df when a custom schema is defined?
Why isn't PySpark JSON writing all keys to df when a custom schema is defined?

Time:08-11

I'm trying to read approx. 2000 files in an s3 bucket, parse the data in each file and then write the parsed output to another bucket.

Each file is made up of arrays of dictionaries. e.g.

[
    {'region_code': "UK", 'city': "London", 'country_name': "England", ...etc.}
]

I have the following schema for my input data. These are the keys I want to grab out of each file:

    schema = StructType([
        StructField('region_code', StringType(), True),
        StructField('country_code', StringType(), True),
        StructField('city', StringType(), True),
        StructField('last_update', StringType(), True),
        StructField('latitude', StringType(), True),
        StructField('tags', ArrayType(StringType(), True), True),
        StructField('area_code', StringType(), True),
        StructField('country_name', StringType(), True),
        StructField('hostnames', ArrayType(StringType(), True), True),
        StructField('org', StringType(), True),
        StructField('asn', StringType(), True),
        StructField('isp', StringType(), True),
        StructField('longitude', StringType(), True),
        StructField('domains', ArrayType(StringType(), True), True),
        StructField('ip_str', StringType(), True),
        StructField('os', StringType(), True),
        StructField('ports', ArrayType(IntegerType(), True), True)
    ])

I read all files in the input s3 bucket folder:

    df = spark.read.option("multiLine", "true").schema(schema).json(
        "s3a://bucket-name/test-folder/*"
    )

and then write the output to another s3 bucket:

df.write.format('json').save("s3a://bucket-name/test/outputdata")

However, when I look at the output data sometimes there are keys missing and I'm not sure why this is happening., e.g.

[
    {'city': "London", 'country_name': "England", ...etc.}, # No 'region_code' key
    {'region_code': "UK", 'country_name': "England", ...etc.} # No 'city' key
]

I thought the schema defines the structure of the dataframe. I'm assuming when a key is missing it means the key wasn't present or null in the input data. In that case though wouldn't the schema define the key anyway and just return null?

CodePudding user response:

Spark might be dropping the null fields while writing the JSON files. You can force spark to write all fields even when they're null using the ignoreNullFields option - set it to False. See doc for more options.

spark_df.write.option('ignoreNullFields', False). \
    json('path/to/folder/')
  • Related