I'm trying to transform a python notebook in pyspark pipeline and I'm blocked by... it seems a simple problem ....
I have this dataframe after a count aggregation group By Id:
| Id | count |
| 0 | 5 |
| 1 | 3 |
| 4 | 6 |
And I want this :
| Id | count |
| 0 | 5 |
| 1 | 3 |
| 2 | 0 |
| 3 | 0 |
| 4 | 6 |
| 5 | 0 |
I have tried to add a [0,1,3,4,5] array in each rows, then explode outter this array, then tried to find a way to keep the rows I need but it's seems a bit complicated for this simple case.
DO you have any tips ? Thx in advance
CodePudding user response:
original.show()
--- -----
| id|count|
--- -----
| 1| 12|
| 3| 15|
--- -----
df = spark.createDataFrame([(0,0),(1,0),(2,0),(3,0),(4,0),(5,0)],['id', 'default_count'])
df.show()
--- -------------
| id|default_count|
--- -------------
| 0| 0|
| 1| 0|
| 2| 0|
| 3| 0|
| 4| 0|
| 5| 0|
--- -------------
result=original.join(df, on='id', how='right').withColumn('count', F.coalesce(F.col('count'), F.col('default_count'))).orderBy(F.col('id')).drop(F.col('default_count'))
--- -----
| id|count|
--- -----
| 0| 0|
| 1| 12|
| 2| 0|
| 3| 15|
| 4| 0|
| 5| 0|
--- -----
CodePudding user response:
df.show()
--- -----
| Id|count|
--- -----
| 0| 5|
| 1| 3|
| 4| 6|
--- -----
extra_rows = spark.createDataFrame([(2, 0),
(3, 0),
(5, 0)],
['Id', 'count'])
df.unionByName(extra_rows).orderBy('Id').show()
--- -----
| Id|count|
--- -----
| 0| 5|
| 1| 3|
| 2| 0|
| 3| 0|
| 4| 6|
| 5| 0|
--- -----