I'm building a machine learning model and I need to populate a test dataframe with synthetic data. I have time series data that currently looks like this:
Date DayOfWeek Unit
2022-10-01 7 A
2022-10-02 1 A
2022-10-03 2 A
What I need is to duplicate all the date rows, but I need a row for each 'Unit' (A,B,C,D) like this:
Date DayOfWeek Unit
2022-10-01 7 A
2022-10-01 7 B
2022-10-01 7 C
2022-10-01 7 D
2022-10-02 1 A
2022-10-02 1 B
2022-10-02 1 C
2022-10-02 1 D
2022-10-03 2 A
2022-10-03 2 B
2022-10-03 2 C
2022-10-03 2 D
I found a previous answer that showed me how to repeat:
df.reindex(df.index.repeat(4)).reset_index(drop=True)
What's the best way to take that, but instead of repeating everything, only repeating 'Date' and "DayOfWeek' but populating A through D on 'Unit'?
CodePudding user response:
Suggest using itertools.product
for the purpose:
from itertools import product
df = pd.DataFrame(
data=product(
pd.Series(pd.date_range('2022-10-01', '2022-10-03', freq='D')),
"ABCD"
),
columns=("Date", "Unit"),
)
df["DayOfWeek"] = df["Date"].dt.dayofweek.add(1) # To Have Day of Week Starting with 1
df = df[["Date", "DayOfWeek", "Unit"]]
print(df)
Output:
Date DayOfWeek Unit
0 2022-10-01 6 A
1 2022-10-01 6 B
2 2022-10-01 6 C
3 2022-10-01 6 D
4 2022-10-02 7 A
5 2022-10-02 7 B
6 2022-10-02 7 C
7 2022-10-02 7 D
8 2022-10-03 1 A
9 2022-10-03 1 B
10 2022-10-03 1 C
11 2022-10-03 1 D
CodePudding user response:
here is one way to do it using merge
df: the dataframe from question
df2: created here
df2=pd.DataFrame({
'Unit': ['A','B','C','D']} )
df[['Date','DayOfWeek']].merge(df2, how='cross')
Date DayOfWeek Unit
0 2022-10-01 7 A
1 2022-10-01 7 B
2 2022-10-01 7 C
3 2022-10-01 7 D
4 2022-10-02 1 A
5 2022-10-02 1 B
6 2022-10-02 1 C
7 2022-10-02 1 D
8 2022-10-03 2 A
9 2022-10-03 2 B
10 2022-10-03 2 C
11 2022-10-03 2 D
CodePudding user response:
One option is with complete from pyjanitor:
# pip install pyjanitor
import janitor
import pandas as pd
df.complete(('Date', 'DayOfWeek'), unit)
Date DayOfWeek Unit
0 2022-10-01 7 A
1 2022-10-01 7 B
2 2022-10-01 7 C
3 2022-10-01 7 D
4 2022-10-02 1 A
5 2022-10-02 1 B
6 2022-10-02 1 C
7 2022-10-02 1 D
8 2022-10-03 2 A
9 2022-10-03 2 B
10 2022-10-03 2 C
11 2022-10-03 2 D
Another option is with merge
:
ser = pd.Series(['A','B','C','D'], name='Unit')
df.drop(columns='Unit').merge(ser, how='cross')
Date DayOfWeek Unit
0 2022-10-01 7 A
1 2022-10-01 7 B
2 2022-10-01 7 C
3 2022-10-01 7 D
4 2022-10-02 1 A
5 2022-10-02 1 B
6 2022-10-02 1 C
7 2022-10-02 1 D
8 2022-10-03 2 A
9 2022-10-03 2 B
10 2022-10-03 2 C
11 2022-10-03 2 D