Home > Software design >  add specific rows on pyspark
add specific rows on pyspark

Time:03-05

I'm trying to transform a python notebook in pyspark pipeline and I'm blocked by... it seems a simple problem ....

I have this dataframe after a count aggregation group By Id:

| Id |  count |
| 0  |   5    |
| 1  |   3    |
| 4  |   6    |

And I want this :

| Id |  count |
| 0  |   5    |
| 1  |   3    |
| 2  |   0    |
| 3  |   0    |
| 4  |   6    |
| 5  |   0    |

I have tried to add a [0,1,3,4,5] array in each rows, then explode outter this array, then tried to find a way to keep the rows I need but it's seems a bit complicated for this simple case.

DO you have any tips ? Thx in advance

CodePudding user response:

original.show()
 --- ----- 
| id|count|
 --- ----- 
|  1|   12|
|  3|   15|
 --- ----- 


df = spark.createDataFrame([(0,0),(1,0),(2,0),(3,0),(4,0),(5,0)],['id', 'default_count'])
df.show()
 --- ------------- 
| id|default_count|
 --- ------------- 
|  0|            0|
|  1|            0|
|  2|            0|
|  3|            0|
|  4|            0|
|  5|            0|
 --- ------------- 

result=original.join(df, on='id', how='right').withColumn('count', F.coalesce(F.col('count'), F.col('default_count'))).orderBy(F.col('id')).drop(F.col('default_count'))

 --- ----- 
| id|count|
 --- ----- 
|  0|    0|
|  1|   12|
|  2|    0|
|  3|   15|
|  4|    0|
|  5|    0|
 --- ----- 

CodePudding user response:

df.show()

 --- ----- 
| Id|count|
 --- ----- 
|  0|    5|
|  1|    3|
|  4|    6|
 --- ----- 

extra_rows = spark.createDataFrame([(2, 0),
                       (3, 0),
                       (5, 0)],
                      ['Id', 'count'])

df.unionByName(extra_rows).orderBy('Id').show()

 --- ----- 
| Id|count|
 --- ----- 
|  0|    5|
|  1|    3|
|  2|    0|
|  3|    0|
|  4|    6|
|  5|    0|
 --- ----- 
  • Related