I have a spark dataframe similar to the one below:
OLD DATAFRAME
---- -------
| ID | count |
---- -------
| 0 | 12000 |
| 1 | 12000 |
| 2 | 12000 |
| 3 | 12000 |
| 4 | 0 |
| 5 | 0 |
| 6 | 0 |
| 7 | 0 |
| 8 | 1400 |
| 9 | 1400 |
---- -------
I need to fill the 0 values with the most recent non-zero value in the column. Example:
TARGET DATAFRAME
---- -------
| ID | count |
---- -------
| 0 | 12000 |
| 1 | 12000 |
| 2 | 12000 |
| 3 | 12000 |
| 4 | 12000 |
| 5 | 12000 |
| 6 | 12000 |
| 7 | 12000 |
| 8 | 1400 |
| 9 | 1400 |
---- -------
How do I go about doing this?
CodePudding user response:
By using last we can acheive the required result
s2=StringIO("""
ID|count
0|13000
1|11000
2|14000
3|12000
4|0
5|0
6|0
7|0
8|1400
9|1400 """)
dfp2=pd.read_csv(s2,sep='|')
dfs2=spark.createDataFrame(dfp2)
dfs2.withColumn("count",F.when(F.col("count")==0,None).otherwise(F.col("count"))).withColumn("count",F.when(F.col("count").isNull(),F.last(F.col("count"),True).over(Window.orderBy("ID"))).otherwise(F.col("count"))).show()
#output
--- -----
| ID|count|
--- -----
| 0|13000|
| 1|11000|
| 2|14000|
| 3|12000|
| 4|12000|
| 5|12000|
| 6|12000|
| 7|12000|
| 8| 1400|
| 9| 1400|
--- -----
Changed some values in input to show that only the last value is being copied below
First I changed 0 to nulls then wherever there was a null value in the column I picked last, not null value ordered by id.
CodePudding user response:
This is the easiest way I can think of.
from pyspark.sql import functions as F, Window as W
df = df.withColumn(
'count',
F.last(F.when(F.col('count') != 0, F.col('count')), True).over(W.orderBy('ID'))
)