Home > Enterprise >  Convert Spark DF to a DS with different fields names
Convert Spark DF to a DS with different fields names

Time:11-03

I want to convert a Spark dataframe to a dataset of a POJO with different fields names. I have a dataframe of the fields: name, date_of_birth, where their types are IntegerType, DateType.

And a POJO of:

public class Person implements Serializable {
    private Integer name;
    private Date dateOfBirth;
}

I convert it to dataset successfully with the following code:

Encoder<Person> personEncoder =  Encoders.bean(Person.class); 
Dataset<Person> personDS = result.as(personEncoder);
List<Person> personList = personDS.collectAsList();

Only if I change the dataframe’s columns names before that, to those of the Person POJO. Is there any way of telling Spark to map between the fields from the POJO side?

I thought about Gson’s @SerializedName(“date_of_birth”) but it didn’t affect anything.

CodePudding user response:

If you have a name mapping, say in a Map, you could use it to rename the columns before converting the dataframe into a dataset.

It could be written like this:

// I create the map, but it could be read from a config file for instance
Map<String, String> nameMapping = new java.util.HashMap<>();
nameMapping.put("id", "name");
nameMapping.put("date", "dateOfBirth");

Column[] renamedColumns = nameMapping
                .entrySet()
                .stream()
                .map(x -> col(x.getKey()).alias(x.getValue()))
                .collect(Collectors.toList())
                .toArray(new Column[0]);

result.select(renamedColumns).as(personEncoder)

CodePudding user response:

I am not aware of specific annotations. However, here is how I'd solve it.

I would create a specific dataframe with the shape I want, then export it.

It would look like:

Dataset<Row> exportDf = df
    .withColumn("dateOfBirth",
        col("date_of_birth").cast(DataTypes.StringType))
    .drop("date_of_birth");

The full example I wrote can be found here: https://github.com/jgperrin/net.jgp.labs.spark/tree/master/src/main/java/net/jgp/labs/spark/l999_scrapbook/l002.

Notes:

  • I am assuming that result in your code is a Dataset<Row>.
  • I used String for your date as Spark was a little touchy about converting a Date to a String in a POJO. If you need help specifically on this issue, create another SO question, I'll happily look at it.
  • Related