I have two PySpark DataFrame objects that I wish to concatenate. One of the DataFrames df_a
has a column unique_id
derived using pyspark.sql.functions.monotonically_increasing_id()
. The other DataFrame, df_b
does not. I want to append the rows of df_b
to df_a
, but I need to generate values for the unique_id
column that do not coincide with any of the values in df_a.unique_id
.
df_a = spark.createDataFrame(
[
(1, "a", 42949672960),
(2, "b", 85899345920),
(3, "c", 128849018880)
],
["number", "letter", "unique_id"]
)
df_b = spark.createDataFrame(
[
(3, "c"),
(4, "c"),
(5, "d")
],
["number", "letter"]
)
df_b = df_b.withColumn("unique_id", F.monotonically_increasing_id())
df = df_a.union(df_b)
df.show()
I looked to see if pyspark.sql.functions.monotonically_increasing_id()
took a parameter enforcing a minimum value, but it does not.
One final thing to note, df_a
is a massive DataFrame that needs to be appended to regularly. If I needed to reassign unique ids to df_a
using a function other than pyspark.sql.functions.monotonically_increasing_id()
to make a potential solution work long-term, I could do so once, but not every time I were to append new data.
Any direction would be appreciated—thank you!
CodePudding user response:
You can always add a constant to monotonically_increasing_id()
:
n = df_a.select(F.max('unique_id').alias('max_n')).first().max_n
df_b = df_b.withColumn("unique_id", F.monotonically_increasing_id() F.lit(n 1))