I am new in pyspark and I need help to search in df.
I have df1 with students data as follows

 --------- ---------- -------------------- 
|studentid|   course |  registration_date |
 --------- ---------- -------------------- 
|      348|         2|     15-11-2021     |
|      567|         1|     05-11-2021     |
|      595|         3|     15-10-2021     |
|      580|         2|     06-11-2021     |
|      448|         4|     15-09-2021     |
 --------- ---------- --------------------

df2. Have info about the registration periods as follows

 -------- ------------ ------------ 
| period | start_date |  end_date  |
 -------- ------------ ------------ 
|       1| 01-09-2021 | 15-09-2021 |
|       2| 16-09-2021 | 30-09-2021 |
|       3| 01-10-2021 | 15-10-2021 |
|       4| 16-10-2021 | 31-10-2021 |
|       5| 01-11-2021 | 15-11-2021 |
|       6| 16-11-2021 | 30-11-2021 |
 -------- ------------ ------------

I need to iter df1 row by row, get student registration_date and with this date, go to df2 and get the period information with condition df2.start_date <= df1.registration_date <= df2.end_date.
Result will be new df as follows

 --------- ---------- -------------------- -------- ------------ ------------ 
|studentid|   course |  registration_date | period | start_date |  end_date  |
 --------- ---------- -------------------- -------- ------------ ------------ 
|      348|         2|     15-11-2021     |       5| 01-11-2021 | 15-11-2021 |
|      567|         1|     05-11-2021     |       5| 01-11-2021 | 15-11-2021 |
|      595|         3|     15-10-2021     |       3| 01-10-2021 | 15-10-2021 |
|      580|         2|     06-11-2021     |       5| 01-11-2021 | 15-11-2021 |
|      448|         4|     15-09-2021     |       1| 01-09-2021 | 15-09-2021 |
 --------- ---------- -------------------- -------- ------------ ------------

CodePudding user response：

You can specify the join criteria as a complex condition.

Working Example

from datetime import datetime
from pyspark.sql import functions as F


df = spark.createDataFrame([
    (348, 2, datetime.strptime("15-11-2021", "%d-%m-%Y")),
    (567, 1, datetime.strptime("05-11-2021", "%d-%m-%Y")),
    (595, 3, datetime.strptime("15-10-2021", "%d-%m-%Y")),
    (580, 2, datetime.strptime("06-11-2021", "%d-%m-%Y")),
    (448, 4, datetime.strptime("15-09-2021", "%d-%m-%Y")),]
, ("studentid", "course", "registration_date",)).withColumn("registration_date", F.to_date(F.col("registration_date")))

df2 = spark.createDataFrame([
    (1, datetime.strptime("01-09-2021", "%d-%m-%Y"), datetime.strptime("15-09-2021", "%d-%m-%Y")),
    (2, datetime.strptime("16-09-2021", "%d-%m-%Y"), datetime.strptime("30-09-2021", "%d-%m-%Y")),
    (3, datetime.strptime("01-10-2021", "%d-%m-%Y"), datetime.strptime("15-10-2021", "%d-%m-%Y")),
    (4, datetime.strptime("16-10-2021", "%d-%m-%Y"), datetime.strptime("31-10-2021", "%d-%m-%Y")),
    (5, datetime.strptime("01-11-2021", "%d-%m-%Y"), datetime.strptime("15-11-2021", "%d-%m-%Y")),
    (6, datetime.strptime("16-11-2021", "%d-%m-%Y"), datetime.strptime("30-11-2021", "%d-%m-%Y")),]
, ("period", "start_date", "end_date")).withColumn("start_date", F.to_date(F.col("start_date"))).withColumn("end_date", F.to_date(F.col("end_date")))

df.join(df2, (df2["start_date"] <= df["registration_date"]) & (df["registration_date"] <= df2["end_date"])).show()

Output

 --------- ------ ----------------- ------ ---------- ---------- 
|studentid|course|registration_date|period|start_date|  end_date|
 --------- ------ ----------------- ------ ---------- ---------- 
|      348|     2|       2021-11-15|     5|2021-11-01|2021-11-15|
|      567|     1|       2021-11-05|     5|2021-11-01|2021-11-15|
|      595|     3|       2021-10-15|     3|2021-10-01|2021-10-15|
|      448|     4|       2021-09-15|     1|2021-09-01|2021-09-15|
|      580|     2|       2021-11-06|     5|2021-11-01|2021-11-15|
 --------- ------ ----------------- ------ ---------- ----------