I am little bit confused on Spark's dataframe .as[]
function,
in the documentation it says
returns a new Dataset where each record has been mapped to the specified type.
but for example, if I do:
case class Person(id: Int, name: String)
case class NewPerson(id: Int)
val person1 = Person(1, "a")
val df = Seq(person1).toDF()
val ds = df.as[NewPerson]
the ds
dataset I get will still have the two columns id
and name
of the class Person
. I would expect to have only the id
column of the class NewPerson
.
What did the function do here?
CodePudding user response:
Actually, as
method only changes the view of the data, not the data itself, as explained in the documentation:
Note that as[] only changes the view of the data that is passed into typed operations, such as map(), and does not eagerly project away any columns that are not present in the specified class.
So as
does not remove columns that are not present in your case class, it just creates a view of your rows that you can use in typed operation.
CodePudding user response:
Adding to Vincent Doba's answer, if you have used a case class A
to create a value ds of type Dataset[A]
then you can truncate it to the fields you need with the following:
val ds_clean: Dataset[A] = ds.map(identity)