Home > Software engineering >  How to validate particular column in a Dataframe without troubling other columns using spark-sql?
How to validate particular column in a Dataframe without troubling other columns using spark-sql?

Time:04-06

set.createOrReplaceTempView("input1");      
            
String look = "select case when length(date)>0 then 'Y' else 'N' end as date from input1"; 
Dataset<Row> Dataset_op = spark.sql(look); 
Dataset_op.show();

In the above code the dataframe 'set' has 10 columns and i've done the validation for one column among them (i.e) 'date'. It return date column alone.

My question is how to return all the columns with the validated date column in a single dataframe?

Is there any way to get all the columns in the dataframe without manually selecting all the columns in the select statement. Please share your suggestions.TIA

CodePudding user response:

Data

df= spark.createDataFrame([
  (1,'2022-03-01'),
  (2,'2022-04-17'),
  (3,None)
],('id','date'))

df.show()

 --- ---------- 
| id|      date|
 --- ---------- 
|  1|2022-03-01|
|  2|2022-04-17|
|  3|      null|
 --- ---------- 

You have two options

Option 1 select without projecting a new column with N and Y

 df.createOrReplaceTempView("input1");   

   
            
    String_look = "select id, date from input1 where length(date)>0"; 
    Dataset_op = spark.sql(String_look).show()

 --- ---------- 
| id|      date|
 --- ---------- 
|  1|2022-03-01|
|  2|2022-04-17|
 --- ---------- 

Or project Y and N into a new column. Remember the where clause is applied before column projection. So you cant use the newly created column in the where clause

String_look = "select id, date, case when length(date)>0 then 'Y' else 'N' end as status from input1 where length(date)>0";

 --- ---------- ------ 
| id|      date|status|
 --- ---------- ------ 
|  1|2022-03-01|     Y|
|  2|2022-04-17|     Y|
 --- ---------- ------ 
  • Related