Home > Enterprise >  Spark Dataframe - Create 12 rows for each cell of a master table
Spark Dataframe - Create 12 rows for each cell of a master table

Time:02-02

I have a table containing Employee IDs and I'd like to add an additional column for Month containing 12 values (1 for each month). I'd like to create a new table where there is 12 rows for each ID in my list.

Take the following example:

 ----- 
|GFCID|
 ----- 
|    1|
|    2|
|    3|
 ----- 
 --------- 
|Yearmonth|
 --------- 
|   202101|
|   202102|
|   202203|
|   202204|
|   202205|
 --------- 

My desired output is something on the lines of

ID      Month
1        Jan
1        Feb
1        March
2        jan
2        March

and so on. I am using pyspark and my current syntax is as follows:

data = [["1"], ["2"], ["3"]]
df = spark.createDataFrame(data, ["GFCID"])
df.show()

data2 = [["202101"], ["202102"], ["202203"], ["202204"], ["202205"]]
df2 = spark.createDataFrame(data2, ["Yearmonth"])
df2.show()

df3 = df.join(df2, df.GFCID == df2.Yearmonth, "outer")
df3.show()

And the output is

 ----- --------- 
|GFCID|Yearmonth|
 ----- --------- 
| null|   202101|
|    3|     null|
| null|   202205|
| null|   202102|
| null|   202204|
|    1|     null|
| null|   202203|
|    2|     null|
 ----- --------- 

I understand this is wrong because there is no common key for the dataframes to join on. Would appreciate your help on this

CodePudding user response:

Here is your code modified with the proper join crossJoin

data = [["1"], ["2"], ["3"]]
df = spark.createDataFrame(data, ["GFCID"])
df.show()

data2 = [["202101"], ["202102"], ["202203"], ["202204"], ["202205"]]
df2 = spark.createDataFrame(data2, ["Yearmonth"])
df2.show()

df3 = df.crossJoin(df2)
df3.show()
 ----- --------- 
|GFCID|Yearmonth|
 ----- --------- 
|    1|   202101|
|    1|   202102|
|    1|   202203|
|    1|   202204|
|    1|   202205|
|    2|   202101|
|    2|   202102|
|    2|   202203|
|    2|   202204|
|    2|   202205|
|    3|   202101|
|    3|   202102|
|    3|   202203|
|    3|   202204|
|    3|   202205|
 ----- --------- 

Another way of doing it without using a join :

from pyspark.sql import functions as F

df2.withColumn("GFCID", F.explode(F.array([F.lit(i) for i in range(1, 13)]))).show()
 --------- ----- 
|Yearmonth|GFCID|
 --------- ----- 
|   202101|    1|
|   202101|    2|
|   202101|    3|
|   202101|    4|
|   202101|    5|
|   202101|    6|
|   202101|    7|
|   202101|    8|
|   202101|    9|
|   202101|   10|
|   202101|   11|
|   202101|   12|
|   202102|    1|
|   202102|    2|
|   202102|    3|
|   202102|    4|
...
  • Related