I have the following dataframe:
| Timestamp | info |
------------------- ----------
|2016-01-01 17:54:30| 8 |
|2016-02-01 12:16:18| 2 |
|2016-03-01 12:17:57| 1 |
|2016-04-01 10:05:21| 2 |
|2016-05-11 18:58:25| 7 |
|2016-06-11 11:18:29| 6 |
|2016-07-01 12:05:21| 3 |
|2016-08-11 11:58:25| 2 |
|2016-09-11 15:18:29| 9 |
I would like to create a new column named count which counts in a window(-2, 0) (current row and previous two) how many values are > 5 (in the first two rows where I cannot perform the operation I would put 0).
The resulting table should be:
| Timestamp | info | count |
------------------- ---------- ----------
|2016-01-01 17:54:30| 8 | 0 |
|2016-02-01 12:16:18| 2 | 0 |
|2016-03-01 12:17:57| 1 | 1 |
|2016-04-01 10:05:21| 2 | 0 |
|2016-05-11 18:58:25| 7 | 1 |
|2016-06-11 11:18:29| 6 | 2 |
|2016-07-01 12:05:21| 3 | 2 |
|2016-08-11 11:58:25| 2 | 1 |
|2016-09-11 15:18:29| 9 | 1 |
I tried to do this but it didn't work:
w = Window.orderBy('Timestamp').rowsBetween(-2, 0)
df_input = df_input.withColumn("count", F.when((F.count("info").over(w) > 5), F.count("info").over(w) > 5).otherwise(0))
CodePudding user response:
The following would work if you don't mind calculations performed for the first 2 rows.
w = Window.orderBy('Timestamp').rowsBetween(-2, 0)
df_input = df_input.withColumn('count', F.count(F.when(F.col('info') > 5, 1)).over(w))
df_input.show()
# ------------------- ---- -----
# | Timestamp|info|count|
# ------------------- ---- -----
# |2016-01-01 17:54:30| 8| 1|
# |2016-02-01 12:16:18| 2| 1|
# |2016-03-01 12:17:57| 1| 1|
# |2016-04-01 10:05:21| 2| 0|
# |2016-05-11 18:58:25| 7| 1|
# |2016-06-11 11:18:29| 6| 2|
# |2016-07-01 12:05:21| 3| 2|
# |2016-08-11 11:58:25| 2| 1|
# |2016-09-11 15:18:29| 9| 1|
# ------------------- ---- -----
If you need 2 first rows to be 0, without changing the window, you can use this when
condition:
w = Window.orderBy('Timestamp').rowsBetween(-2, 0)
df_input = df_input.withColumn(
'count',
F.when(F.size(F.collect_list('info').over(w)) == 3, F.count(F.when(F.col('info') > 5, 1)).over(w))
.otherwise(0)
)
df_input.show()
# ------------------- ---- -----
# | Timestamp|info|count|
# ------------------- ---- -----
# |2016-01-01 17:54:30| 8| 0|
# |2016-02-01 12:16:18| 2| 0|
# |2016-03-01 12:17:57| 1| 1|
# |2016-04-01 10:05:21| 2| 0|
# |2016-05-11 18:58:25| 7| 1|
# |2016-06-11 11:18:29| 6| 2|
# |2016-07-01 12:05:21| 3| 2|
# |2016-08-11 11:58:25| 2| 1|
# |2016-09-11 15:18:29| 9| 1|
# ------------------- ---- -----