Home > Back-end >  Use python variables aggregating Spark dataframe
Use python variables aggregating Spark dataframe

Time:07-02

I have a dataframe and a list of variables which is called customerids:

customerid createdate personid birthday genders lastupdated locale parentid profiles providers
1028598965607080 2022-06-20 15:03: 2bfe51b1 null [] 2022-06-20 15:03: {} 00000000-0000-000 74c302ef ba337d36
1020304050607099 2022-06-20 15:03: 04919240 1991-09-08 [] 2022-06-20 15:03: {} 00000000-0000-000 2ee40b26 71024633
8423413884965465 2022-06-20 15:03: 3a97d280 null [] 2022-06-20 15:03: {} 00000000-0000-000 29d3a7b0 77973479
8948423132187895 2022-06-20 15:03: 4d6122b9 1991-12-23 [] 2022-06-20 15:03: {} 00000000-0000-000 39f0ec37 c2774f60
1028598965607080 2022-06-20 15:03: 6e55e9f8 1995-09-15 [] 2022-06-20 15:03: {} 00000000-0000-000 01d9ba4e cf6b749a
1028598965607080 2022-06-20 15:03: be76de55 1965-08-05 [] 2022-06-20 15:03: {} 00000000-0000-000 896a97e6 d441f25f
8798432154654555 2022-06-20 15:03: c9118c73 null [] 2022-06-20 15:03: {} 00000000-0000-000 01598fd4 648d5915
1231897897421258 2022-06-20 15:03: e4893ee2 1995-04-12 [] 2022-06-20 15:03: {} 00000000-0000-000 cf31c1a3 f0290168

List of variables:

customerids = [1028598965607080, 1020304050607099, 8423413884965465, 8948423132187895, 8798432154654555, 1231897897421258]

Other variable: date = datetime.strptime(date_str, '%Y-%m-%d %H:%M:%S') - timedelta(days=1)

Basically, I need to do the select below in the dataframe, taking the variables of date and customerids to be able to generate a new dataframe with the column count:

SELECT count(1) FROM dataframe WHERE createdate <'{date} 21:00:00' and customerid = '{customerids}

How can I do this?

Expected output:

customerid count
1028598965607080 3
1020304050607099 1
8423413884965465 1
8948423132187895 1
8798432154654555 1
1231897897421258 1

CodePudding user response:

Dataframe:

from datetime import datetime, timedelta
from pyspark.sql import functions as F

df = spark.createDataFrame(
    [(1028598965607080, '2022-06-20 15:03:01'),
     (1020304050607099, '2022-06-20 15:03:01'),
     (8423413884965465, '2022-06-20 15:03:01'),
     (8948423132187895, '2022-06-20 15:03:01'),
     (1028598965607080, '2022-06-20 15:03:01'),
     (1028598965607080, '2022-06-20 15:03:01'),
     (8798432154654555, '2022-06-20 15:03:01'),
     (1231897897421258, '2022-06-20 15:03:01')],
    ['customerid', 'createdate'])

Your date variable has another variable inside (date_str). So I created it myself.

date_str = '2022-07-01 12:58:58'

customerids = [1028598965607080, 1020304050607099, 8423413884965465, 8948423132187895, 8798432154654555, 1231897897421258]
date = datetime.strptime(date_str, '%Y-%m-%d %H:%M:%S') - timedelta(days=1)

ts = f"{date.year}, {date.month}, {date.day}, 21, 0, 0"
df = (df.filter(F.col('customerid').isin(customerids) &
                (F.col('createdate') < F.expr(f"make_timestamp({ts})")))
        .groupBy('customerid')
        .count()
)
df.show()
#  ---------------- ----- 
# |      customerid|count|
#  ---------------- ----- 
# |1028598965607080|    3|
# |1020304050607099|    1|
# |8423413884965465|    1|
# |8948423132187895|    1|
# |8798432154654555|    1|
# |1231897897421258|    1|
#  ---------------- ----- 
  • Related