There are several good answers about managing duplicate columns from joined dataframes, eg (How to avoid duplicate columns after join?), but what if I'm simply presented a DataFrame with duplicate columns that I have to deal with. I have no control over the processes leading up to this point.
What I have:
val data = Seq((1,2),(3,4)).toDF("a","a")
data.show
--- ---
| a| a|
--- ---
| 1| 2|
| 3| 4|
--- ---
What I want:
--- ---
| a|a_2|
--- ---
| 1| 2|
| 3| 4|
--- ---
withColumnRenamed("a","a_2")
does not work, for obvious reasons.
CodePudding user response:
The simplest way I found to do this is:
val data = Seq((1,2),(3,4)).toDF("a","a")
val deduped = data.toDF("a","a_2")
deduped.show
--- ---
| a|a_2|
--- ---
| 1| 2|
| 3| 4|
--- ---
For a more general solution:
val data = Seq(
(1,2,3,4,5,6,7,8),
(9,0,1,2,3,4,5,6)
).toDF("a","b","c","a","d","b","e","b")
data.show
--- --- --- --- --- --- --- ---
| a| b| c| a| d| b| e| b|
--- --- --- --- --- --- --- ---
| 1| 2| 3| 4| 5| 6| 7| 8|
| 9| 0| 1| 2| 3| 4| 5| 6|
--- --- --- --- --- --- --- ---
import scala.annotation.tailrec
def dedupeColumnNames(df: DataFrame): DataFrame = {
@tailrec
def dedupe(fixed_columns: List[String], columns: List[String]): List[String] = {
if (columns.isEmpty) fixed_columns
else {
val count = columns.groupBy(identity).mapValues(_.size)(columns.head)
if (count == 1) dedupe(columns.head :: fixed_columns, columns.tail)
else dedupe(s"${columns.head}_${count}":: fixed_columns, columns.tail)
}
}
val new_columns = dedupe(List.empty[String], df.columns.reverse.toList).toArray
df.toDF(new_columns:_*)
}
data
.transform(dedupeColumnNames)
.show
--- --- --- --- --- --- --- ---
| a| b| c|a_2| d|b_2| e|b_3|
--- --- --- --- --- --- --- ---
| 1| 2| 3| 4| 5| 6| 7| 8|
| 9| 0| 1| 2| 3| 4| 5| 6|
--- --- --- --- --- --- --- ---