Home > Enterprise >  Pandas: Create multiple rows of 'dummy data' from one row
Pandas: Create multiple rows of 'dummy data' from one row

Time:09-30

I'm building a machine learning model and I need to populate a test dataframe with synthetic data. I have time series data that currently looks like this:

Date    DayOfWeek   Unit 
2022-10-01   7   A
2022-10-02   1   A
2022-10-03   2   A

What I need is to duplicate all the date rows, but I need a row for each 'Unit' (A,B,C,D) like this:

Date    DayOfWeek   Unit 
2022-10-01   7   A
2022-10-01   7   B
2022-10-01   7   C
2022-10-01   7   D
2022-10-02   1   A
2022-10-02   1   B
2022-10-02   1   C
2022-10-02   1   D
2022-10-03   2   A
2022-10-03   2   B
2022-10-03   2   C
2022-10-03   2   D

I found a previous answer that showed me how to repeat:

df.reindex(df.index.repeat(4)).reset_index(drop=True)

What's the best way to take that, but instead of repeating everything, only repeating 'Date' and "DayOfWeek' but populating A through D on 'Unit'?

CodePudding user response:

Suggest using itertools.product for the purpose:

from itertools import product

df = pd.DataFrame(
    data=product(
        pd.Series(pd.date_range('2022-10-01', '2022-10-03', freq='D')),
        "ABCD"
    ),
    columns=("Date", "Unit"),
)
df["DayOfWeek"] = df["Date"].dt.dayofweek.add(1)  # To Have Day of Week Starting with 1
df = df[["Date", "DayOfWeek", "Unit"]]
print(df)

Output:

         Date  DayOfWeek Unit
0  2022-10-01          6    A
1  2022-10-01          6    B
2  2022-10-01          6    C
3  2022-10-01          6    D
4  2022-10-02          7    A
5  2022-10-02          7    B
6  2022-10-02          7    C
7  2022-10-02          7    D
8  2022-10-03          1    A
9  2022-10-03          1    B
10 2022-10-03          1    C
11 2022-10-03          1    D

CodePudding user response:

here is one way to do it using merge

df: the dataframe from question
df2: created here

df2=pd.DataFrame({
    'Unit': ['A','B','C','D']} )

df[['Date','DayOfWeek']].merge(df2, how='cross')
    Date        DayOfWeek   Unit
0   2022-10-01          7   A
1   2022-10-01          7   B
2   2022-10-01          7   C
3   2022-10-01          7   D
4   2022-10-02          1   A
5   2022-10-02          1   B
6   2022-10-02          1   C
7   2022-10-02          1   D
8   2022-10-03          2   A
9   2022-10-03          2   B
10  2022-10-03          2   C
11  2022-10-03          2   D

CodePudding user response:

One option is with complete from pyjanitor:

# pip install pyjanitor
import janitor
import pandas as pd

df.complete(('Date', 'DayOfWeek'), unit)
          Date  DayOfWeek Unit
0   2022-10-01          7    A
1   2022-10-01          7    B
2   2022-10-01          7    C
3   2022-10-01          7    D
4   2022-10-02          1    A
5   2022-10-02          1    B
6   2022-10-02          1    C
7   2022-10-02          1    D
8   2022-10-03          2    A
9   2022-10-03          2    B
10  2022-10-03          2    C
11  2022-10-03          2    D

Another option is with merge :

ser = pd.Series(['A','B','C','D'], name='Unit')
df.drop(columns='Unit').merge(ser, how='cross')
          Date  DayOfWeek Unit
0   2022-10-01          7    A
1   2022-10-01          7    B
2   2022-10-01          7    C
3   2022-10-01          7    D
4   2022-10-02          1    A
5   2022-10-02          1    B
6   2022-10-02          1    C
7   2022-10-02          1    D
8   2022-10-03          2    A
9   2022-10-03          2    B
10  2022-10-03          2    C
11  2022-10-03          2    D
  • Related