Home > Blockchain >  Count occurrences of a filtered value in a window
Count occurrences of a filtered value in a window

Time:06-29

I have the following dataframe:

|     Timestamp     |   info   |
 ------------------- ---------- 
|2016-01-01 17:54:30|     8    |
|2016-02-01 12:16:18|     2    |
|2016-03-01 12:17:57|     1    |
|2016-04-01 10:05:21|     2    |
|2016-05-11 18:58:25|     7    |
|2016-06-11 11:18:29|     6    |
|2016-07-01 12:05:21|     3    |
|2016-08-11 11:58:25|     2    |
|2016-09-11 15:18:29|     9    |

I would like to create a new column named count which counts in a window(-2, 0) (current row and previous two) how many values are > 5 (in the first two rows where I cannot perform the operation I would put 0).

The resulting table should be:

|     Timestamp     |   info   |   count  |
 ------------------- ---------- ---------- 
|2016-01-01 17:54:30|     8    |     0    |
|2016-02-01 12:16:18|     2    |     0    |
|2016-03-01 12:17:57|     1    |     1    |
|2016-04-01 10:05:21|     2    |     0    |
|2016-05-11 18:58:25|     7    |     1    |
|2016-06-11 11:18:29|     6    |     2    |
|2016-07-01 12:05:21|     3    |     2    |
|2016-08-11 11:58:25|     2    |     1    |
|2016-09-11 15:18:29|     9    |     1    |

I tried to do this but it didn't work:

w = Window.orderBy('Timestamp').rowsBetween(-2, 0)
df_input = df_input.withColumn("count", F.when((F.count("info").over(w) > 5), F.count("info").over(w) > 5).otherwise(0))

CodePudding user response:

The following would work if you don't mind calculations performed for the first 2 rows.

w = Window.orderBy('Timestamp').rowsBetween(-2, 0)
df_input = df_input.withColumn('count', F.count(F.when(F.col('info') > 5, 1)).over(w))

df_input.show()
#  ------------------- ---- ----- 
# |          Timestamp|info|count|
#  ------------------- ---- ----- 
# |2016-01-01 17:54:30|   8|    1|
# |2016-02-01 12:16:18|   2|    1|
# |2016-03-01 12:17:57|   1|    1|
# |2016-04-01 10:05:21|   2|    0|
# |2016-05-11 18:58:25|   7|    1|
# |2016-06-11 11:18:29|   6|    2|
# |2016-07-01 12:05:21|   3|    2|
# |2016-08-11 11:58:25|   2|    1|
# |2016-09-11 15:18:29|   9|    1|
#  ------------------- ---- ----- 

If you need 2 first rows to be 0, without changing the window, you can use this when condition:

w = Window.orderBy('Timestamp').rowsBetween(-2, 0)
df_input = df_input.withColumn(
    'count',
    F.when(F.size(F.collect_list('info').over(w)) == 3, F.count(F.when(F.col('info') > 5, 1)).over(w))
     .otherwise(0)
)
df_input.show()
#  ------------------- ---- ----- 
# |          Timestamp|info|count|
#  ------------------- ---- ----- 
# |2016-01-01 17:54:30|   8|    0|
# |2016-02-01 12:16:18|   2|    0|
# |2016-03-01 12:17:57|   1|    1|
# |2016-04-01 10:05:21|   2|    0|
# |2016-05-11 18:58:25|   7|    1|
# |2016-06-11 11:18:29|   6|    2|
# |2016-07-01 12:05:21|   3|    2|
# |2016-08-11 11:58:25|   2|    1|
# |2016-09-11 15:18:29|   9|    1|
#  ------------------- ---- ----- 
  • Related