How to split data by date and predict using sklift from sklearn?-CodePudding

I am using the sklift objects from sklearn to develop an uplift model (Solo Model). I am trying to split the data for train and validation whereas for each partition, there are X, y, and treatment columns. Treatment in this case would be a user getting notification (boolean) and y would be user converting (boolean).

stratify_cols = pd.concat([notification_flag, df.converted_flag], axis=1)
df=df.drop(['notification_flag','converted_flag'],axis=1)

X_train, X_val, trmnt_train, trmnt_val, y_train, y_val = train_test_split(
    df,
    stratify_cols.notification_flag,
    stratify_cols.converted_flag,
    stratify=stratify_cols,
    test_size=0.3,
    random_state=42
)

How to use all my data up to 2022-01-01 in my data frame (df) for the training and everything after that date for validation. How do I do this?
How do I predict for a new unseen dataset and return all three columns, the uplift, the actual baseline probability (if treated), and counterfactual probability (if untreated), i.e. 3% uplift / 30% conversion prob if treated / 27% conversion prob if not treated?

CodePudding user response：

Convert your date column to Pandas DatetimeIndex and perform a slice.

To convert you can use:

df["date"] = pd.to_datetime(df["date"])

Example:

import numpy as np
import pandas as pd

# generating a random df
df = pd.DataFrame(np.random.random((500,3)))

#generating random dates
df['date'] = pd.date_range('2021-1-1', periods=500, freq='D')
df = df.set_index(['date'])
train_df = df.loc[:'2022-01-01']

print(train_df.tail())

                   0         1         2
date                                    
2021-12-28  0.027423  0.740380  0.606964
2021-12-29  0.609302  0.602346  0.812362
2021-12-30  0.171841  0.250788  0.182188
2021-12-31  0.322778  0.287429  0.585201
2022-01-01  0.014228  0.798382  0.769986