Home > Enterprise >  How do I define a schema for a dataframe by deriving it from a table where the custom schema is prov
How do I define a schema for a dataframe by deriving it from a table where the custom schema is prov

Time:09-17

I have a table in RDBMS which I'm taking into a dataframe(DF1):

1               employee_id    
2               employee_name  
3               salary
4               designation

And I have a dataframe(DF2) with the following:

_c0   _c1       _c2        _c3
101   monali    70000      developer
102   Amy       70000      developer
103   neha      65000      tester

How do I define the schema for DF2 from DF1. I want DF2 to have the schema that is defined in the above table. expected output:

employee_id   employee_name  salary     designation  
101           monali         70000      developer
102           Amy            70000      developer
103           neha           65000      tester

I want to make it parameterized.

CodePudding user response:

You can create a function mapColumnNames that takes two parameters, dataframe containing the columns (which I call columns dataframe) and dataframe you want to change columns' name (which I call data dataframe).

This function first retrieves name and id of each column in columns dataframe as a list of tuples. Then it iterates over this list of tuples, applying method withColumnRenamed on data dataframe on each iteration.

Then you can call this function mapColumnNames with DF1 as columns dataframe and DF2 as data dataframe.

Below the complete code:

def mapColumnNames(columns: DataFrame, data: DataFrame): DataFrame = {
  val columnNames = columns.collect().map(x => (x.getInt(0) - 1, x.getString(1)))
  columnNames.foldLeft(data)((data, columnName) => {
    data.withColumnRenamed(s"_c${columnName._1}", columnName._2)
  })
}

val output = mapColumnNames(DF1, DF2)

CodePudding user response:

It wasn't clear what schema does your df1 holds, so used index 1 reference to fetch columns

val columns = df1.select($"1").collect()

Otherwise, we can get all the columns associated with the first dataframe

val columns = df1.schema.fieldNames.map(col(_))

and then use select with columns fetched for our new dataframe

val newDF = df2.select(columns :_*)
  • Related