Home > Software design >  Create Spark dataframe from groupby and known sequence vector
Create Spark dataframe from groupby and known sequence vector

Time:02-08

I have the following dataframe

| Col1 | Col2 | Value  |
|------|------|--------|
| Key1 | Key2 | 45.0   |
| Key2 | Key2 | 34.0   |
| Key2 | Key3 | 121.0  |

and I have the sequence of values [1,2,3]. I would like to generate the following dataframe

| Col1 | Col2 | Value  |
|------|------|--------|
| Key1 | Key2 |  1     |
| Key1 | Key2 |  2     |
| Key1 | Key2 |  3     |
| Key2 | Key2 |  1     |
| Key2 | Key2 |  2     |
| Key2 | Key2 |  3     |
| Key2 | Key3 |  1     |
| Key2 | Key3 |  2     |
| Key2 | Key3 |  3     |

The pairs of (Col1,Col2) are generate from a groupBy and, in my new dataframe, I would like each pair returned by the GroupBy to be repeated three times, once for each value of the sequence. I will then use the combination of (Col1,Col2,Value) to left join that dataframe with another one.

One solution I have thought about could be to take the union of three different dataframe, each of them with rows containing distinct (Col1,Col2) with the given value of the sequence, that is

df.groupBy(['Col1','Col2']).agg(F.lit(1).alias('Value')).union(df.groupBy(['Col1','Col2']).agg(F.lit(2).alias('Value')))

I would like to find a more elegant solution as this one looks very similar to what I would do in SQL.

CodePudding user response:

Simply create a literal array column and explode it:

from pyspark.sql import functions as F

df = spark.createDataFrame([
    ("Key1", "Key2", 45.0), ("Key2", "Key2", 34.0),
    ("Key2", "Key3", 121.0)], ("Col1", "Col2", "Value"))

values = [1, 2, 3]
df1 = df.withColumn("Value", F.explode(F.array(*[F.lit(x) for x in values])))

df1.show()
# ---- ---- ----- 
#|Col1|Col2|Value|
# ---- ---- ----- 
#|Key1|Key2|    1|
#|Key1|Key2|    2|
#|Key1|Key2|    3|
#|Key2|Key2|    1|
#|Key2|Key2|    2|
#|Key2|Key2|    3|
#|Key2|Key3|    1|
#|Key2|Key3|    2|
#|Key2|Key3|    3|
# ---- ---- ----- 
  •  Tags:  
  • Related