Fill rows with previous value when value = 0-CodePudding

I have a spark dataframe similar to the one below:

OLD DATAFRAME
 ---- ------- 
| ID | count |
 ---- ------- 
| 0  | 12000 |
| 1  | 12000 |
| 2  | 12000 |
| 3  | 12000 |
| 4  | 0     |
| 5  | 0     |
| 6  | 0     |
| 7  | 0     |
| 8  | 1400  |
| 9  | 1400  |
 ---- -------

I need to fill the 0 values with the most recent non-zero value in the column. Example:

TARGET DATAFRAME
 ---- ------- 
| ID | count |
 ---- ------- 
| 0  | 12000 |
| 1  | 12000 |
| 2  | 12000 |
| 3  | 12000 |
| 4  | 12000 |
| 5  | 12000 |
| 6  | 12000 |
| 7  | 12000 |
| 8  | 1400  |
| 9  | 1400  |
 ---- -------

How do I go about doing this?

CodePudding user response：

By using last we can acheive the required result

s2=StringIO("""
ID|count
0|13000
1|11000
2|14000
3|12000
4|0    
5|0    
6|0    
7|0    
8|1400 
9|1400 """)

dfp2=pd.read_csv(s2,sep='|')
dfs2=spark.createDataFrame(dfp2)
dfs2.withColumn("count",F.when(F.col("count")==0,None).otherwise(F.col("count"))).withColumn("count",F.when(F.col("count").isNull(),F.last(F.col("count"),True).over(Window.orderBy("ID"))).otherwise(F.col("count"))).show()

#output
 --- ----- 
| ID|count|
 --- ----- 
|  0|13000|
|  1|11000|
|  2|14000|
|  3|12000|
|  4|12000|
|  5|12000|
|  6|12000|
|  7|12000|
|  8| 1400|
|  9| 1400|
 --- -----

Changed some values in input to show that only the last value is being copied below

First I changed 0 to nulls then wherever there was a null value in the column I picked last, not null value ordered by id.

CodePudding user response：

This is the easiest way I can think of.

from pyspark.sql import functions as F, Window as W

df = df.withColumn(
    'count',
    F.last(F.when(F.col('count') != 0, F.col('count')), True).over(W.orderBy('ID'))
)