Home > Enterprise >  Why schema_of_json fails?
Why schema_of_json fails?

Time:09-27

Why second transformation fails?

df
  .withColumn("schemaDetected", schema_of_json(lit("""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}""")))
  .show(false)
//shows: struct<City:string,State:string,ZipCodeType:string,Zipcode:bigint>

df
  .withColumn("literal", lit("""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""))
  .withColumn("schemaDetected", schema_of_json(col("literal")))
  .show(false)
// it fails:
// cannot resolve 'schema_of_json(`literal`)' due to data type mismatch: The input json should be a string literal and not null; however, got `literal`.;;

I need to get a columns schema but it only accepts a 'lit' included inside of the method, when I add a Column it fails. ¿why?

CodePudding user response:

The second transformation fails because when you pass a column containing a json string from a dataframe to schema_of_json function Spark can't be sure that each row json string of this column will evaluate to the same schema

To understand why having the same schema for all rows is important, you must acknowledge that main use case for creation of schema_of_json function is to infer schema for from_json function.

from_json transforms a json string to a struct, basically several new columns. And of course all rows in a dataframe should have the same columns. So, you have to ensure that when you use from_json on a dataframe, you pass the same schema to all rows.

At origin, you should have been able to pass a column that is not a literal to schema_of_json, as you can see in ticket SPARK-24642. However, it meant to merge infered schema for each row. For instance, if you have the following dataframe:

 ------------------ 
| json_string      |
 ------------------ 
| {"a": 1}         |
| {"b": 2}         |
| {"a": 3, "c": 4} |
 ------------------ 

And you want to pass a schema returned by schema_of_json to from_json function, you have to infer schema STRUCT<a INT, b INT, c INT> to be able to have a consistent schema for all rows. To do so, you will need to transform schema_of_json to an aggregate function that will build schema by iterating over all values of argument column or to have schema inferences (like when you read an CSV). It was considered not worth it.

So, instead of being able to pass a dataframe column as argument, you can only pass literal, that you can get using function lit as it was implemented in SPARK-24709

schema_of_json was upgraded later in Spark 3.1 (see SPARK-31044) to be able to take foldable columns, meaning columns that can be evaluated statically before query execution. An example of such foldable column is regexp_replace(lit("""{"test":1}""", "1", "2")), that is not directly a literal but a transformation of a literal

So to come back to your case, when you create a column from your lit(...), Spark sees a column but not its content. As Spark doesn't see column's content, it can't be sure that values for this column are the same. But Spark has to ensure that schema generated by those values are the same so it has to be sure that input values are the same. So it throws an error basically saying "I can't be sure that for all rows, json strings contained in this column are the same"

  • Related