Home > OS >  How to set default value in Dataset parsed from Dataframe if column is missing
How to set default value in Dataset parsed from Dataframe if column is missing

Time:12-12

I'm trying to create a DataSet from the Dataframe using a case class.

case class test (language:String, users_count: String = "100")

 -------- ----------- 
|language|users_count|
 -------- ----------- 
|    Java|      20000|
|  Python|     100000|
|   Scala|       3000|
 -------- ----------- 

df.as[test]

How to handle the scenario where a column is missing in the dataframe ? The expectation is dataset populates default value provided in the case class.

If the dataframe only has one column, it throws an exception

org.apache.spark.sql.AnalysisException: cannot resolve 'users_count' given input columns: [language];

Expected Result:

 -------- 
|language|
 -------- 
|    Java|      
|  Python|     
|   Scala| 
 -------- 

df.as[test].collect(0)
test('Java',100) // where 100 is the default value

CodePudding user response:

You could use the map function and explicitly call the constructor like this:

 df
    .map(row => test(row.getAs[String]("language")))
    .show
 -------- ----------- 
|language|users_count|
 -------- ----------- 
|    Java|        100|
|  Python|        100|
|   Scala|        100|
 -------- ----------- 
  • Related