I am using the sklift objects from sklearn to develop an uplift model (Solo Model). I am trying to split the data for train and validation whereas for each partition, there are X, y, and treatment
columns. Treatment in this case would be a user getting notification (boolean) and y would be user converting (boolean).
stratify_cols = pd.concat([notification_flag, df.converted_flag], axis=1)
df=df.drop(['notification_flag','converted_flag'],axis=1)
X_train, X_val, trmnt_train, trmnt_val, y_train, y_val = train_test_split(
df,
stratify_cols.notification_flag,
stratify_cols.converted_flag,
stratify=stratify_cols,
test_size=0.3,
random_state=42
)
How to use all my data up to
2022-01-01
in my data frame (df) for the training and everything after that date for validation. How do I do this?How do I predict for a new unseen dataset and return all three columns, the uplift, the actual baseline probability (if treated), and counterfactual probability (if untreated),
i.e. 3% uplift / 30% conversion prob if treated / 27% conversion prob if not treated?
CodePudding user response:
Convert your date column to Pandas DatetimeIndex
and perform a slice.
To convert you can use:
df["date"] = pd.to_datetime(df["date"])
Example:
import numpy as np
import pandas as pd
# generating a random df
df = pd.DataFrame(np.random.random((500,3)))
#generating random dates
df['date'] = pd.date_range('2021-1-1', periods=500, freq='D')
df = df.set_index(['date'])
train_df = df.loc[:'2022-01-01']
print(train_df.tail())
0 1 2
date
2021-12-28 0.027423 0.740380 0.606964
2021-12-29 0.609302 0.602346 0.812362
2021-12-30 0.171841 0.250788 0.182188
2021-12-31 0.322778 0.287429 0.585201
2022-01-01 0.014228 0.798382 0.769986