I have a dataframe with some columns, let's say they are called
|State| Color|Count
I want to check if a column exists or not in that dataframe so I have to create it or not. I know that for this small example is quite useless to do so because I only have 3 columns and I could do it manually, but I want to know the way to do it with bigger DFs. I first thought of this:
var cols = df.columns
df.withColumn("x", when(col("x").between(cols(0), cols(cols.length-1)), 5).otherwise(null))
My intention with that was to check if the column "x" was in the DF (in the colection of its columns) and if it wasn't, create it with the withColumn method with null values, but I don't know if that is correct. Is there any other way to do it? My other ideas are to do it with a foreach loop and ifs, but I don't think that would be efficient.
CodePudding user response:
Given the following example DataFrame
:
val df = Seq(
("a", "blue", 2),
("b", "red", 1),
("c", "yellow", 3),
("d", "blue", 4),
).toDF("state", "colour", "count")
You can check for missing columns and add them in with the following:
val expectedColumns = Set("state", "colour", "count", "x")
val actualColumns = df.columns
val missingColumns = (expectedColumns -- actualColumns.toSet).map(lit(null).as(_))
df.select(actualColumns.map(col) missingColumns: _*)
You specify the columns you expect to be present (expectedColumns
), and then compare that to what is in df
. If any are missing you create a column of null
values with the appropriate name and then use a select statement to add them back in.
CodePudding user response:
As per your intention, I am not sure why you want to do it with withColumn completely?
I think a more easier, efficient and readable approach will be
val df2 = if (!df.columns.contains("x")) {
df.withColumn("x", lit(null))
} else {
df
}