Home > Back-end >  Check for a column name in PySpark dataframe when schema is given
Check for a column name in PySpark dataframe when schema is given

Time:11-09

I have a schema structure as below:

StructField('results', ArrayType(MapType(StringType(), StringType()), True), True), 
StructField('search_information', MapType(StringType(), StringType()), True), 
StructField('metadata', MapType(StringType(), StringType()), True), 
StructField('parameters', MapType(StringType(), StringType()), True), 
StructField('results_2', MapType(StringType(), StringType()), True),

And I have above columns in a file and each file may or may not have the these columns and I am reading JSON file as

spark.read.JSON.option(schema=schema, path=path)

I need to check for some column existing and make necessary transformations. I am checking for column existence as

if "metadata:" in df.schema.simpleString(): 

The above is always returning "True" as I have defined schema. How to check file raw data for column existence?

CodePudding user response:

You can read the file without specifying the schema:

df = spark.read.option('multiline', 'true').json('file_name.json')

Then, if you want to check for column existance, you can use one of the following:

if 'metadata' in df.columns:
if 'metadata' in df.schema.names:

Another way is to use Python tools to check for existence of keys inside JSON:

import json

j = json.loads(the_file)

if "metadata" in j:
  • Related