Rename Duplicate Columns of a Spark DataFrame?-CodePudding

There are several good answers about managing duplicate columns from joined dataframes, eg (How to avoid duplicate columns after join?), but what if I'm simply presented a DataFrame with duplicate columns that I have to deal with. I have no control over the processes leading up to this point.

What I have:

val data = Seq((1,2),(3,4)).toDF("a","a")
data.show

 --- --- 
|  a|  a|
 --- --- 
|  1|  2|
|  3|  4|
 --- ---

What I want:

 --- --- 
|  a|a_2|
 --- --- 
|  1|  2|
|  3|  4|
 --- ---

withColumnRenamed("a","a_2") does not work, for obvious reasons.

CodePudding user response：

The simplest way I found to do this is:

val data = Seq((1,2),(3,4)).toDF("a","a")
val deduped = data.toDF("a","a_2")
deduped.show

 --- --- 
|  a|a_2|
 --- --- 
|  1|  2|
|  3|  4|
 --- ---

For a more general solution:

val data = Seq(
  (1,2,3,4,5,6,7,8),
  (9,0,1,2,3,4,5,6)
).toDF("a","b","c","a","d","b","e","b")
data.show

 --- --- --- --- --- --- --- --- 
|  a|  b|  c|  a|  d|  b|  e|  b|
 --- --- --- --- --- --- --- --- 
|  1|  2|  3|  4|  5|  6|  7|  8|
|  9|  0|  1|  2|  3|  4|  5|  6|
 --- --- --- --- --- --- --- --- 

import scala.annotation.tailrec

def dedupeColumnNames(df: DataFrame): DataFrame = {
  
  @tailrec
  def dedupe(fixed_columns: List[String], columns: List[String]): List[String] = {
    if (columns.isEmpty) fixed_columns
    else {
      val count = columns.groupBy(identity).mapValues(_.size)(columns.head)
      if (count == 1) dedupe(columns.head :: fixed_columns, columns.tail)
      else dedupe(s"${columns.head}_${count}":: fixed_columns, columns.tail)
    }
  }
  
  val new_columns = dedupe(List.empty[String], df.columns.reverse.toList).toArray
  df.toDF(new_columns:_*)
}

data
  .transform(dedupeColumnNames)
  .show

 --- --- --- --- --- --- --- --- 
|  a|  b|  c|a_2|  d|b_2|  e|b_3|
 --- --- --- --- --- --- --- --- 
|  1|  2|  3|  4|  5|  6|  7|  8|
|  9|  0|  1|  2|  3|  4|  5|  6|
 --- --- --- --- --- --- --- ---