Append/Union multiple dataframes in Scala-CodePudding

I'm coming from a python background, trying to convert a function over in to scala.

In this dummy example, I have multiple (unknown number) of dataframes that I need to union together.

%python

list_of_dfs = [
    spark.createDataFrame(
         [('A', 'C'),    
          ('B', 'E')
         ], ['dummy1','dummy2']),
    spark.createDataFrame(
             [('F', 'G'),    
              ('H', 'I')
             ], ['dummy1','dummy2'])]

for i, df in enumerate(list_of_dfs):
    if i == 0:
        union_df = df
    else:
        union_df = union_df.unionAll(df)
        
union_df.display()

Works just how I want it to. The "union_df = union_df.unionAll(df)" is specifically what I'm having trouble reproducing in scala.

    %scala
    ... outer loop creates each iterations dataframe
    if(i==0) {
      val union_df=df 
    } else{
      val union_df=union_df.union(df)
    }

I get this "error: recursive value union_df needs type". Which I'm having trouble translating the documentation in to my solution, because the type is a dataframe. Obviously I need to actually learn something about scala, but this is the bridge I'm trying to cross right now. Appreciate any help.

CodePudding user response：

You don't need to manually manage a loop to go through the collection in Scala. Since you're trying to go from many values to one we can use the reduce method:

  val dfs: Iterable[DataFrame] = ???
  val union_df = dfs.reduce(_ union _)

CodePudding user response：

in the Scala code you have val union_df=union_df.union(df) -> you are defining a value and tried to call it.

should be something like this:

if(i==0) {
   val union_df=df 
} else{
   union_df = union_df.union(df)
}

The previous answer is better, use reduce or foldLeft(foldRight) function instead.

CodePudding user response：

I'll accept Jarrod Baker's answer since I'm sure it's more appropriate.

But what ended up working for me was instantiating it as an empty dataframe and then doing the appends.

%scala
... outer loop creates each iterations dataframe
var union_df = spark.emptyDataFrame
if(i==0) {
  union_df=df
} else{
  union_df=union_df.union(df)
}