My understanding: If I have a model class that extends a second model class, I shouldn't be able to access the private members of the parent class in the child class (unless I use reflection).
Extending this, I expect that when a Spark dataframe is encoded as a dataset of the child model class, it shouldn't have columns that include private members of the parent model class. (But this is not what I observe.)
More concretely, my parent class:
public class Foo {
private int one;
protected String two;
protected double three;
}
The child class:
public class Bar extends Foo {
private int four;
protected String five;
}
I have a couple of Bar
objects that I use to create a Spark dataframe i.e., Dataset<Row>
like so:
Dataset<Row> barDF = session.createDataframe(barList, Bar.class);
When, at a later point, I want to encode this as a dataset,
Dataset<Bar> barDS = barDF.as(Encoders.bean(Bar.class));
I expect barDS
to have four columns (excluding one
, the private member of Foo
). But the result of barDS.show()
is instead:
------ ------ ----- ------- -----
| five | four | one | three | two |
------ ------ ----- ------- -----
| 9 | 9 | 0 | 3.0 | 3 |
| 16 | 16 | 0 | 4.0 | 4 |
------ ------ ----- ------- -----
What am I missing in expecting one
not to be present in the dataset? Also, what encoding can I use instead of bean encoding so that Java's rules of inheritance are obeyed?
CodePudding user response:
As @Ivo Beckers comment helped clarify, Spark
must be using reflection. But it looks like it uses getters and setters to actually access the fields. So if I declare the getter/setter for one
as protected
, everything works as expected.