Home > Blockchain >  Append/Union multiple dataframes in Scala
Append/Union multiple dataframes in Scala

Time:09-30

I'm coming from a python background, trying to convert a function over in to scala.

In this dummy example, I have multiple (unknown number) of dataframes that I need to union together.

%python

list_of_dfs = [
    spark.createDataFrame(
         [('A', 'C'),    
          ('B', 'E')
         ], ['dummy1','dummy2']),
    spark.createDataFrame(
             [('F', 'G'),    
              ('H', 'I')
             ], ['dummy1','dummy2'])]

for i, df in enumerate(list_of_dfs):
    if i == 0:
        union_df = df
    else:
        union_df = union_df.unionAll(df)
        
union_df.display()

Works just how I want it to. The "union_df = union_df.unionAll(df)" is specifically what I'm having trouble reproducing in scala.

    %scala
    ... outer loop creates each iterations dataframe
    if(i==0) {
      val union_df=df 
    } else{
      val union_df=union_df.union(df)
    }  

I get this "error: recursive value union_df needs type". Which I'm having trouble translating the documentation in to my solution, because the type is a dataframe. Obviously I need to actually learn something about scala, but this is the bridge I'm trying to cross right now. Appreciate any help.

CodePudding user response:

You don't need to manually manage a loop to go through the collection in Scala. Since you're trying to go from many values to one we can use the reduce method:

  val dfs: Iterable[DataFrame] = ???
  val union_df = dfs.reduce(_ union _)

CodePudding user response:

in the Scala code you have val union_df=union_df.union(df) -> you are defining a value and tried to call it.

should be something like this:

if(i==0) {
   val union_df=df 
} else{
   union_df = union_df.union(df)
}

The previous answer is better, use reduce or foldLeft(foldRight) function instead.

CodePudding user response:

I'll accept Jarrod Baker's answer since I'm sure it's more appropriate.

But what ended up working for me was instantiating it as an empty dataframe and then doing the appends.

%scala
... outer loop creates each iterations dataframe
var union_df = spark.emptyDataFrame
if(i==0) {
  union_df=df
} else{
  union_df=union_df.union(df)
} 
  • Related