Home > Software design >  How to decompose cohort data?
How to decompose cohort data?

Time:09-23

I'm trying to decompose cohort data into time series for further analysis. I'm imagining the algorithm pretty well, but my code doesn't work at all.

The input data in df is like:

Cohort Day 0 1 2 3 4 5
2020-12-27 5.87 4.9 2.89 1.47 1.38 0.95
2020-12-28 13.2 3.1 0.79 1.47 1.38 0.95

I'm trying to decompose it in this format:

day sum
2020-12-27 5.87
2020-12-28 4.9
2020-12-29 2.89
2020-12-30 1.47
2020-12-31 1.38
2020-01-01 0.95
2020-12-28 13.2
2020-12-29 3.1
2020-12-30 0.79
2020-12-31 1.47
2020-01-01 1.38
2020-01-02 0.95

To achieve that I created an empty dataframe test and then I'm using for loop to create a column with dates at first:

    for row in test.itertuples(): 
        test[0:5, 0] = df['Cohort Day']   df.apply(lambda x: int(str(df.iloc[0, 4:].columns)) for x in df.iteritems())
        test[0:5, 1] = df[0, 1:].transpose()

But all I receive is an empty test dataframe.

Any suggestions will be appreciated!

CodePudding user response:

Avoid using looping codes which are slow. Use fast vectorized Pandas built-in functions whenever possible.

You can transform the dataframe from wide to long by .stack(). Set day as Cohort Day plus the day offsets 0, 1, ..., 5, as follows:

# convert `Cohort Day` to datetime format
df['Cohort Day'] = pd.to_datetime(df['Cohort Day'])

# transform from wide to long
df2 = (df.set_index('Cohort Day')
         .rename_axis(columns='day_offset')
         .stack()
         .reset_index(name='sum')
      )

# convert day offsets 0, 1, 2, ..., 5 to timedelta format
df2['day_offset'] = pd.to_timedelta(df2['day_offset'].astype(int), unit='d')

# set up column `day` as the `Cohort Day`   day offsets
df2['day'] = df2['Cohort Day']   df2['day_offset']

# Get the desired columns
df_out = df2[['day', 'sum']]

Result:

print(df_out)

          day    sum
0  2020-12-27   5.87
1  2020-12-28   4.90
2  2020-12-29   2.89
3  2020-12-30   1.47
4  2020-12-31   1.38
5  2021-01-01   0.95
6  2020-12-28  13.20
7  2020-12-29   3.10
8  2020-12-30   0.79
9  2020-12-31   1.47
10 2021-01-01   1.38
11 2021-01-02   0.95
  • Related