Optimizing timedelta calculation between two dates with different 'weekmask' logic in Pola-CodePudding

I am very impressed of the Polars library and trying to learn it better. :)

Now I am trying to calculate days between two dates in Polars for millions of rows, but there are conditions that for some rows I need to exclude certain weekdays. In Pandas/Numpy I have utilized np.busday_count where I can define a weekmask of which weekdays to count per condition and also exclude holidays when needed.

I'm having difficulties counting the days with conditions in a fast way as I can't figure the way how to do this in expression.

Example dataframe:

df = (pl
    .DataFrame({"Market": ["AT", "DE", "AT", "CZ", "GB", "CZ"],
              "Service": ["Standard", "Express", "Standard", "Standard", "Standard", "Standard"],
              "Day1": ["2022-01-02","2022-01-03", "2022-01-04", "2022-01-05", "2022-01-06", "2022-01-07"],
              "Day2": ["2022-01-03","2022-01-04", "2022-01-05", "2022-01-06", "2022-01-07", "2022-01-08"]
             }
            )
    .with_columns(pl.col(["Day1", "Day2"]).str.strptime(pl.Date, "%Y-%m-%d"))
)

I was able to pass the data to np.busday_function through struct and apply method. However the execution is much slower with the real dataset (34.4 seconds) compared to Pandas assign (262ms).

Below the code I was able to come up with in Polars. I'm looking for an optimized way of doing this quicker.

(df
    .with_column(
        pl.struct([pl.col("Day1"), pl.col("Day2")])
        .apply(lambda x: np.busday_count(x["Day1"], x["Day2"], weekmask='1110000'))
        .alias("Result"))
)

EDIT, expected output:

┌────────┬──────────┬────────────┬────────────┬────────┐
│ Market ┆ Service  ┆ Day1       ┆ Day2       ┆ Result │
│ ---    ┆ ---      ┆ ---        ┆ ---        ┆ ---    │
│ str    ┆ str      ┆ date       ┆ date       ┆ i64    │
╞════════╪══════════╪════════════╪════════════╪════════╡
│ AT     ┆ Standard ┆ 2022-01-02 ┆ 2022-01-03 ┆ 0      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ DE     ┆ Express  ┆ 2022-01-03 ┆ 2022-01-04 ┆ 1      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ AT     ┆ Standard ┆ 2022-01-04 ┆ 2022-01-05 ┆ 1      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ CZ     ┆ Standard ┆ 2022-01-05 ┆ 2022-01-06 ┆ 1      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ GB     ┆ Standard ┆ 2022-01-06 ┆ 2022-01-07 ┆ 0      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ CZ     ┆ Standard ┆ 2022-01-07 ┆ 2022-01-08 ┆ 0      │
└────────┴──────────┴────────────┴────────────┴────────┘

CodePudding user response：

I think you should be able to:

>>> df.select(
...    np.busday_count(
...       pl.col("Day1"),
...       pl.col("Day2"),
...       weekmask="1110000"
...    )
... )

It currently raises an exception though - which is perhaps a bug

ValueError: Could not convert object to NumPy datetime

You can pass pl.Series objects directly to numpy:

>>> np.busday_count(
...    df.get_column("Day1"),
...    df.get_column("Day2"),
...    weekmask="1110000"
... )
array([0, 1, 1, ..., 1, 0, 0])

pl.from_numpy() creates a dataframe - so I think creating a Series is the simplest way to store the result as a column:

%%time
result = pl.Series(
   np.busday_count(
      df.get_column("Day1"),
      df.get_column("Day2"),
      weekmask="1110000"
   )
)
>>> df.with_column(result.alias("Result"))
CPU times: user 22.9 ms, sys: 10.7 ms, total: 33.6 ms
Wall time: 32.8 ms
shape: (1200000, 5)
┌────────┬──────────┬────────────┬────────────┬────────┐
│ Market ┆ Service  ┆ Day1       ┆ Day2       ┆ Result │
│ ---    ┆ ---      ┆ ---        ┆ ---        ┆ ---    │
│ str    ┆ str      ┆ date       ┆ date       ┆ i64    │
╞════════╪══════════╪════════════╪════════════╪════════╡
│ AT     ┆ Standard ┆ 2022-01-02 ┆ 2022-01-03 ┆ 0      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ DE     ┆ Express  ┆ 2022-01-03 ┆ 2022-01-04 ┆ 1      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ AT     ┆ Standard ┆ 2022-01-04 ┆ 2022-01-05 ┆ 1      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ CZ     ┆ Standard ┆ 2022-01-05 ┆ 2022-01-06 ┆ 1      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ GB     ┆ Standard ┆ 2022-01-06 ┆ 2022-01-07 ┆ 0      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
...

CodePudding user response：

When you use apply in the select context, you will be creating a python dictionary and feeding that to your lambda for every element in the list. This is expensive.

You can utilize vectorization, by using map instead of apply. That way we can send whole columns at a time to numpys busday_count.

(df
    .with_column(
        pl.struct([pl.col("Day1"), pl.col("Day2")])
        .map(lambda x: np.busday_count(x.struct["Day1"], x.struct["Day2"], weekmask='1110000'))
        .alias("Result"))
)

CodePudding user response：

I ended up doing the following for this scenario, inspired by jqurious answer.

First creating a function that takes a weekday mask and holiday country code to know which bank holidays to exclude for each row (another df having country code and bank holiday columns).

def calculate_transit_days(holiday="None", transit_days="1111100"):
    
    # If no market input, then holiday exceptions are not included and delivery days counted Mon - Fri by default
    if holiday == "None":
        return pl.Series(np.busday_count(df.get_column("Shipped Time").cast(pl.Date), 
                                        df.get_column("First Attempt").cast(pl.Date), 
                                        weekmask=transit_days))
    else:
        return pl.Series(np.busday_count(df.get_column("Shipped Time").cast(pl.Date), 
                                            df.get_column("First Attempt").cast(pl.Date), 
                                            weekmask=transit_days,
                                            holidays=(holidays
                                                        .filter((pl.col("Market") == holiday) & (pl.col("Applies To Delivery") == True))
                                                        .select(pl.col("Holiday Date"))
                                                        .collect()
                                                        .to_series()
                                                        .to_numpy())))

And then I used nested when then with the function to do the mapping as needed.

(df
    # Below defines which weekdays are being calculated for each market as transit days. E.g. "1111110" would mean Mon - Sat are counted.
     .with_columns(pl.when((pl.col("Market") == "UK") & (pl.col("Shipped Time").dt.weekday() == 7)).then(calculate_transit_days(holiday="UK", transit_days="1111110")   1)
                      .when(pl.col("Market") == "UK").then(calculate_transit_days(holiday="UK", transit_days="1111110"))
                      .when(pl.col("Market") == "AT").then(calculate_transit_days(holiday="AT", transit_days="1111100"))
                      .when(pl.col("Market") == "DE").then(calculate_transit_days(holiday="DE", transit_days="1111100"))
                      .when(pl.col("Market") == "DK").then(calculate_transit_days(holiday="DK", transit_days="1111100"))               
                      .when(pl.col("Market") == "ES").then(calculate_transit_days(holiday="ES", transit_days="1111100"))
                      .when(pl.col("Market") == "FR").then(calculate_transit_days(holiday="FR", transit_days="1111100"))
                      .when(pl.col("Market") == "IT").then(calculate_transit_days(holiday="IT", transit_days="1111100"))
                      .when(pl.col("Market") == "NL").then(calculate_transit_days(holiday="NL", transit_days="1111100"))
                      .when(pl.col("Market") == "PL").then(calculate_transit_days(holiday="PL", transit_days="1111100"))
                      .when(pl.col("Market") == "SE").then(calculate_transit_days(holiday="SE", transit_days="1111100"))
                      .when(pl.col("Market") == "US").then(calculate_transit_days(holiday="US", transit_days="1111100"))
                      .when(pl.col("Market") == "CA").then(calculate_transit_days(holiday="CA", transit_days="1111100"))
                      .otherwise(calculate_transit_days()).alias("Transit Days"))
)

Not the prettiest solution maybe but that seemed to run fast enough for me (3 secs for 5m rows). :) Only problem with the code above is that it doesn't work in lazy mode due to the "get_column" in the function. Hope someday it works with pl.col("ColName") as well.

Anyways, excellent library which I will focus on in future!