Use python variables aggregating Spark dataframe-CodePudding

I have a dataframe and a list of variables which is called customerids:

customerid	createdate	personid	birthday	genders	lastupdated	locale	parentid	profiles	providers
1028598965607080	2022-06-20 15:03:	2bfe51b1	null	[]	2022-06-20 15:03:	{}	00000000-0000-000	74c302ef	ba337d36
1020304050607099	2022-06-20 15:03:	04919240	1991-09-08	[]	2022-06-20 15:03:	{}	00000000-0000-000	2ee40b26	71024633
8423413884965465	2022-06-20 15:03:	3a97d280	null	[]	2022-06-20 15:03:	{}	00000000-0000-000	29d3a7b0	77973479
8948423132187895	2022-06-20 15:03:	4d6122b9	1991-12-23	[]	2022-06-20 15:03:	{}	00000000-0000-000	39f0ec37	c2774f60
1028598965607080	2022-06-20 15:03:	6e55e9f8	1995-09-15	[]	2022-06-20 15:03:	{}	00000000-0000-000	01d9ba4e	cf6b749a
1028598965607080	2022-06-20 15:03:	be76de55	1965-08-05	[]	2022-06-20 15:03:	{}	00000000-0000-000	896a97e6	d441f25f
8798432154654555	2022-06-20 15:03:	c9118c73	null	[]	2022-06-20 15:03:	{}	00000000-0000-000	01598fd4	648d5915
1231897897421258	2022-06-20 15:03:	e4893ee2	1995-04-12	[]	2022-06-20 15:03:	{}	00000000-0000-000	cf31c1a3	f0290168

List of variables:

customerids = [1028598965607080, 1020304050607099, 8423413884965465, 8948423132187895, 8798432154654555, 1231897897421258]

Other variable: date = datetime.strptime(date_str, '%Y-%m-%d %H:%M:%S') - timedelta(days=1)

Basically, I need to do the select below in the dataframe, taking the variables of date and customerids to be able to generate a new dataframe with the column count:

SELECT count(1) FROM dataframe WHERE createdate <'{date} 21:00:00' and customerid = '{customerids}

How can I do this?

Expected output:

customerid	count
1028598965607080	3
1020304050607099	1
8423413884965465	1
8948423132187895	1
8798432154654555	1
1231897897421258	1

CodePudding user response：

Dataframe:

from datetime import datetime, timedelta
from pyspark.sql import functions as F

df = spark.createDataFrame(
    [(1028598965607080, '2022-06-20 15:03:01'),
     (1020304050607099, '2022-06-20 15:03:01'),
     (8423413884965465, '2022-06-20 15:03:01'),
     (8948423132187895, '2022-06-20 15:03:01'),
     (1028598965607080, '2022-06-20 15:03:01'),
     (1028598965607080, '2022-06-20 15:03:01'),
     (8798432154654555, '2022-06-20 15:03:01'),
     (1231897897421258, '2022-06-20 15:03:01')],
    ['customerid', 'createdate'])

Your date variable has another variable inside (date_str). So I created it myself.

date_str = '2022-07-01 12:58:58'

customerids = [1028598965607080, 1020304050607099, 8423413884965465, 8948423132187895, 8798432154654555, 1231897897421258]
date = datetime.strptime(date_str, '%Y-%m-%d %H:%M:%S') - timedelta(days=1)

ts = f"{date.year}, {date.month}, {date.day}, 21, 0, 0"
df = (df.filter(F.col('customerid').isin(customerids) &
                (F.col('createdate') < F.expr(f"make_timestamp({ts})")))
        .groupBy('customerid')
        .count()
)
df.show()
#  ---------------- ----- 
# |      customerid|count|
#  ---------------- ----- 
# |1028598965607080|    3|
# |1020304050607099|    1|
# |8423413884965465|    1|
# |8948423132187895|    1|
# |8798432154654555|    1|
# |1231897897421258|    1|
#  ---------------- -----