Home > Software engineering >  PySpark: How to concatenate two distinct dataframes?
PySpark: How to concatenate two distinct dataframes?

Time:07-28

I have multiple dataframes that I need to concatenate together, row-wise. In pandas, we would typically write: pd.concat([df1, df2]).

This thread: How to concatenate/append multiple Spark dataframes column wise in Pyspark? appears close, but its respective answer:

    df1_schema = StructType([StructField("id",IntegerType()),StructField("name",StringType())])
    df1 = spark.sparkContext.parallelize([(1, "sammy"),(2, "jill"),(3, "john")])
    df1 = spark.createDataFrame(df1, schema=df1_schema)
    df2_schema = StructType([StructField("secNo",IntegerType()),StructField("city",StringType())])

    df2 = spark.sparkContext.parallelize([(101, "LA"),(102, "CA"),(103,"DC")])
    df2 = spark.createDataFrame(df2, schema=df2_schema)

schema = StructType(df1.schema.fields   df2.schema.fields)
df1df2 = df1.rdd.zip(df2.rdd).map(lambda x: x[0] x[1])
spark.createDataFrame(df1df2, schema).show()

Yields the following error when done on my data at scale: Can only zip RDDs with same number of elements in each partition

How can I join 2 or more data frames that are identical in row length but are otherwise independent of content (they share a similar repeating structure/order but contain no shared data)?

Example expected data looks like:

     --- -----          ----- ----         --- ----- ----- ---- 
    | id| name|        |secNo|city|       | id| name|secNo|city|
     --- -----          ----- ----         --- ----- ----- ---- 
    |  1|sammy|        |  101|  LA|   =>  |  1|sammy|  101|  LA|
    |  2| jill|        |  102|  CA|       |  2| jill|  102|  CA|
    |  3| john|        |  103|  DC|       |  3| john|  103|  DC|
     --- -----          ----- ----         --- ----- ----- ---- 

CodePudding user response:

You can create unique IDs with

df1 = df1.withColumn("unique_id", expr("row_number() over (order by (select null))"))

df2 = df2.withColumn("unique_id", expr("row_number() over (order by (select null))"))

then, you can left join them

df1.join(df2, Seq("unique_id"), "left").drop("unique_id")

Final output looks like

 --- ---- --- ------- 
| id|name|age|address|
 --- ---- --- ------- 
|  1|   a|  7|      x|
|  2|   b|  8|      y|
|  3|   c|  9|      z|
 --- ---- --- ------- 
  • Related