I have a dataframe and a list of variables which is called customerids
:
customerid | createdate | personid | birthday | genders | lastupdated | locale | parentid | profiles | providers |
---|---|---|---|---|---|---|---|---|---|
1028598965607080 | 2022-06-20 15:03: | 2bfe51b1 | null | [] | 2022-06-20 15:03: | {} | 00000000-0000-000 | 74c302ef | ba337d36 |
1020304050607099 | 2022-06-20 15:03: | 04919240 | 1991-09-08 | [] | 2022-06-20 15:03: | {} | 00000000-0000-000 | 2ee40b26 | 71024633 |
8423413884965465 | 2022-06-20 15:03: | 3a97d280 | null | [] | 2022-06-20 15:03: | {} | 00000000-0000-000 | 29d3a7b0 | 77973479 |
8948423132187895 | 2022-06-20 15:03: | 4d6122b9 | 1991-12-23 | [] | 2022-06-20 15:03: | {} | 00000000-0000-000 | 39f0ec37 | c2774f60 |
1028598965607080 | 2022-06-20 15:03: | 6e55e9f8 | 1995-09-15 | [] | 2022-06-20 15:03: | {} | 00000000-0000-000 | 01d9ba4e | cf6b749a |
1028598965607080 | 2022-06-20 15:03: | be76de55 | 1965-08-05 | [] | 2022-06-20 15:03: | {} | 00000000-0000-000 | 896a97e6 | d441f25f |
8798432154654555 | 2022-06-20 15:03: | c9118c73 | null | [] | 2022-06-20 15:03: | {} | 00000000-0000-000 | 01598fd4 | 648d5915 |
1231897897421258 | 2022-06-20 15:03: | e4893ee2 | 1995-04-12 | [] | 2022-06-20 15:03: | {} | 00000000-0000-000 | cf31c1a3 | f0290168 |
List of variables:
customerids = [1028598965607080, 1020304050607099, 8423413884965465, 8948423132187895, 8798432154654555, 1231897897421258]
Other variable:
date = datetime.strptime(date_str, '%Y-%m-%d %H:%M:%S') - timedelta(days=1)
Basically, I need to do the select below in the dataframe, taking the variables of date
and customerids
to be able to generate a new dataframe with the column count:
SELECT count(1) FROM dataframe WHERE createdate <'{date} 21:00:00' and customerid = '{customerids}
How can I do this?
Expected output:
customerid | count |
---|---|
1028598965607080 | 3 |
1020304050607099 | 1 |
8423413884965465 | 1 |
8948423132187895 | 1 |
8798432154654555 | 1 |
1231897897421258 | 1 |
CodePudding user response:
Dataframe:
from datetime import datetime, timedelta
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1028598965607080, '2022-06-20 15:03:01'),
(1020304050607099, '2022-06-20 15:03:01'),
(8423413884965465, '2022-06-20 15:03:01'),
(8948423132187895, '2022-06-20 15:03:01'),
(1028598965607080, '2022-06-20 15:03:01'),
(1028598965607080, '2022-06-20 15:03:01'),
(8798432154654555, '2022-06-20 15:03:01'),
(1231897897421258, '2022-06-20 15:03:01')],
['customerid', 'createdate'])
Your date
variable has another variable inside (date_str
). So I created it myself.
date_str = '2022-07-01 12:58:58'
customerids = [1028598965607080, 1020304050607099, 8423413884965465, 8948423132187895, 8798432154654555, 1231897897421258]
date = datetime.strptime(date_str, '%Y-%m-%d %H:%M:%S') - timedelta(days=1)
ts = f"{date.year}, {date.month}, {date.day}, 21, 0, 0"
df = (df.filter(F.col('customerid').isin(customerids) &
(F.col('createdate') < F.expr(f"make_timestamp({ts})")))
.groupBy('customerid')
.count()
)
df.show()
# ---------------- -----
# | customerid|count|
# ---------------- -----
# |1028598965607080| 3|
# |1020304050607099| 1|
# |8423413884965465| 1|
# |8948423132187895| 1|
# |8798432154654555| 1|
# |1231897897421258| 1|
# ---------------- -----