I have multiple dataframes that I need to concatenate together, row-wise. In pandas, we would typically write: pd.concat([df1, df2])
.
This thread: How to concatenate/append multiple Spark dataframes column wise in Pyspark? appears close, but its respective answer:
df1_schema = StructType([StructField("id",IntegerType()),StructField("name",StringType())])
df1 = spark.sparkContext.parallelize([(1, "sammy"),(2, "jill"),(3, "john")])
df1 = spark.createDataFrame(df1, schema=df1_schema)
df2_schema = StructType([StructField("secNo",IntegerType()),StructField("city",StringType())])
df2 = spark.sparkContext.parallelize([(101, "LA"),(102, "CA"),(103,"DC")])
df2 = spark.createDataFrame(df2, schema=df2_schema)
schema = StructType(df1.schema.fields df2.schema.fields)
df1df2 = df1.rdd.zip(df2.rdd).map(lambda x: x[0] x[1])
spark.createDataFrame(df1df2, schema).show()
Yields the following error when done on my data at scale: Can only zip RDDs with same number of elements in each partition
How can I join 2 or more data frames that are identical in row length but are otherwise independent of content (they share a similar repeating structure/order but contain no shared data)?
Example expected data looks like:
--- ----- ----- ---- --- ----- ----- ----
| id| name| |secNo|city| | id| name|secNo|city|
--- ----- ----- ---- --- ----- ----- ----
| 1|sammy| | 101| LA| => | 1|sammy| 101| LA|
| 2| jill| | 102| CA| | 2| jill| 102| CA|
| 3| john| | 103| DC| | 3| john| 103| DC|
--- ----- ----- ---- --- ----- ----- ----
CodePudding user response:
You can create unique IDs with
df1 = df1.withColumn("unique_id", expr("row_number() over (order by (select null))"))
df2 = df2.withColumn("unique_id", expr("row_number() over (order by (select null))"))
then, you can left
join them
df1.join(df2, Seq("unique_id"), "left").drop("unique_id")
Final output looks like
--- ---- --- -------
| id|name|age|address|
--- ---- --- -------
| 1| a| 7| x|
| 2| b| 8| y|
| 3| c| 9| z|
--- ---- --- -------