I am very impressed of the Polars library and trying to learn it better. :)
Now I am trying to calculate days between two dates in Polars for millions of rows, but there are conditions that for some rows I need to exclude certain weekdays. In Pandas/Numpy I have utilized np.busday_count where I can define a weekmask of which weekdays to count per condition and also exclude holidays when needed.
I'm having difficulties counting the days with conditions in a fast way as I can't figure the way how to do this in expression.
Example dataframe:
df = (pl
.DataFrame({"Market": ["AT", "DE", "AT", "CZ", "GB", "CZ"],
"Service": ["Standard", "Express", "Standard", "Standard", "Standard", "Standard"],
"Day1": ["2022-01-02","2022-01-03", "2022-01-04", "2022-01-05", "2022-01-06", "2022-01-07"],
"Day2": ["2022-01-03","2022-01-04", "2022-01-05", "2022-01-06", "2022-01-07", "2022-01-08"]
}
)
.with_columns(pl.col(["Day1", "Day2"]).str.strptime(pl.Date, "%Y-%m-%d"))
)
I was able to pass the data to np.busday_function through struct and apply method. However the execution is much slower with the real dataset (34.4 seconds) compared to Pandas assign (262ms).
Below the code I was able to come up with in Polars. I'm looking for an optimized way of doing this quicker.
(df
.with_column(
pl.struct([pl.col("Day1"), pl.col("Day2")])
.apply(lambda x: np.busday_count(x["Day1"], x["Day2"], weekmask='1110000'))
.alias("Result"))
)
EDIT, expected output:
┌────────┬──────────┬────────────┬────────────┬────────┐
│ Market ┆ Service ┆ Day1 ┆ Day2 ┆ Result │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ date ┆ date ┆ i64 │
╞════════╪══════════╪════════════╪════════════╪════════╡
│ AT ┆ Standard ┆ 2022-01-02 ┆ 2022-01-03 ┆ 0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ DE ┆ Express ┆ 2022-01-03 ┆ 2022-01-04 ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ AT ┆ Standard ┆ 2022-01-04 ┆ 2022-01-05 ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ CZ ┆ Standard ┆ 2022-01-05 ┆ 2022-01-06 ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ GB ┆ Standard ┆ 2022-01-06 ┆ 2022-01-07 ┆ 0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ CZ ┆ Standard ┆ 2022-01-07 ┆ 2022-01-08 ┆ 0 │
└────────┴──────────┴────────────┴────────────┴────────┘
CodePudding user response:
I think you should be able to:
>>> df.select(
... np.busday_count(
... pl.col("Day1"),
... pl.col("Day2"),
... weekmask="1110000"
... )
... )
It currently raises an exception though - which is perhaps a bug
ValueError: Could not convert object to NumPy datetime
You can pass pl.Series objects directly to numpy:
>>> np.busday_count(
... df.get_column("Day1"),
... df.get_column("Day2"),
... weekmask="1110000"
... )
array([0, 1, 1, ..., 1, 0, 0])
pl.from_numpy()
creates a dataframe - so I think creating a Series is the simplest way to store the result as a column:
%%time
result = pl.Series(
np.busday_count(
df.get_column("Day1"),
df.get_column("Day2"),
weekmask="1110000"
)
)
>>> df.with_column(result.alias("Result"))
CPU times: user 22.9 ms, sys: 10.7 ms, total: 33.6 ms
Wall time: 32.8 ms
shape: (1200000, 5)
┌────────┬──────────┬────────────┬────────────┬────────┐
│ Market ┆ Service ┆ Day1 ┆ Day2 ┆ Result │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ date ┆ date ┆ i64 │
╞════════╪══════════╪════════════╪════════════╪════════╡
│ AT ┆ Standard ┆ 2022-01-02 ┆ 2022-01-03 ┆ 0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ DE ┆ Express ┆ 2022-01-03 ┆ 2022-01-04 ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ AT ┆ Standard ┆ 2022-01-04 ┆ 2022-01-05 ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ CZ ┆ Standard ┆ 2022-01-05 ┆ 2022-01-06 ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ GB ┆ Standard ┆ 2022-01-06 ┆ 2022-01-07 ┆ 0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
...
CodePudding user response:
When you use apply
in the select
context, you will be creating a python dictionary and feeding that to your lambda for every element in the list. This is expensive.
You can utilize vectorization
, by using map
instead of apply. That way we can send whole columns at a time to numpys busday_count
.
(df
.with_column(
pl.struct([pl.col("Day1"), pl.col("Day2")])
.map(lambda x: np.busday_count(x.struct["Day1"], x.struct["Day2"], weekmask='1110000'))
.alias("Result"))
)
CodePudding user response:
I ended up doing the following for this scenario, inspired by jqurious answer.
First creating a function that takes a weekday mask and holiday country code to know which bank holidays to exclude for each row (another df having country code and bank holiday columns).
def calculate_transit_days(holiday="None", transit_days="1111100"):
# If no market input, then holiday exceptions are not included and delivery days counted Mon - Fri by default
if holiday == "None":
return pl.Series(np.busday_count(df.get_column("Shipped Time").cast(pl.Date),
df.get_column("First Attempt").cast(pl.Date),
weekmask=transit_days))
else:
return pl.Series(np.busday_count(df.get_column("Shipped Time").cast(pl.Date),
df.get_column("First Attempt").cast(pl.Date),
weekmask=transit_days,
holidays=(holidays
.filter((pl.col("Market") == holiday) & (pl.col("Applies To Delivery") == True))
.select(pl.col("Holiday Date"))
.collect()
.to_series()
.to_numpy())))
And then I used nested when then with the function to do the mapping as needed.
(df
# Below defines which weekdays are being calculated for each market as transit days. E.g. "1111110" would mean Mon - Sat are counted.
.with_columns(pl.when((pl.col("Market") == "UK") & (pl.col("Shipped Time").dt.weekday() == 7)).then(calculate_transit_days(holiday="UK", transit_days="1111110") 1)
.when(pl.col("Market") == "UK").then(calculate_transit_days(holiday="UK", transit_days="1111110"))
.when(pl.col("Market") == "AT").then(calculate_transit_days(holiday="AT", transit_days="1111100"))
.when(pl.col("Market") == "DE").then(calculate_transit_days(holiday="DE", transit_days="1111100"))
.when(pl.col("Market") == "DK").then(calculate_transit_days(holiday="DK", transit_days="1111100"))
.when(pl.col("Market") == "ES").then(calculate_transit_days(holiday="ES", transit_days="1111100"))
.when(pl.col("Market") == "FR").then(calculate_transit_days(holiday="FR", transit_days="1111100"))
.when(pl.col("Market") == "IT").then(calculate_transit_days(holiday="IT", transit_days="1111100"))
.when(pl.col("Market") == "NL").then(calculate_transit_days(holiday="NL", transit_days="1111100"))
.when(pl.col("Market") == "PL").then(calculate_transit_days(holiday="PL", transit_days="1111100"))
.when(pl.col("Market") == "SE").then(calculate_transit_days(holiday="SE", transit_days="1111100"))
.when(pl.col("Market") == "US").then(calculate_transit_days(holiday="US", transit_days="1111100"))
.when(pl.col("Market") == "CA").then(calculate_transit_days(holiday="CA", transit_days="1111100"))
.otherwise(calculate_transit_days()).alias("Transit Days"))
)
Not the prettiest solution maybe but that seemed to run fast enough for me (3 secs for 5m rows). :) Only problem with the code above is that it doesn't work in lazy mode due to the "get_column" in the function. Hope someday it works with pl.col("ColName") as well.
Anyways, excellent library which I will focus on in future!