Home > Software design >  Extract weekly data from daily and reshape it from long to wide format using Pandas
Extract weekly data from daily and reshape it from long to wide format using Pandas

Time:11-30

Given a sample data as follows, I hope to extract one data entry for each week, if for the week having multiple entries, then I will use the largest weekday's data as for that week:

          date  variable      value
0    2020-11-4  quantity   564.0000
1   2020-11-11  quantity   565.0000
2   2020-11-18  quantity   566.0000
3   2020-11-25  quantity   566.0000
4    2020-11-2     price  1829.1039
5    2020-11-3     price  1789.5883
6    2020-11-4     price  1755.4307
7    2020-11-5     price  1750.0727
8    2020-11-6     price  1746.7239
9    2020-11-9     price  1756.1005
10  2020-11-10     price  1752.0820
11  2020-11-11     price  1814.3693
12  2020-11-12     price  1833.7922
13  2020-11-13     price  1833.7922
14  2020-11-16     price  1784.2302
15  2020-11-17     price  1764.1376
16  2020-11-18     price  1770.1654
17  2020-11-19     price  1757.4400
18  2020-11-20     price  1770.1654

To get week number of each date, I use df['week_number'] = pd.to_datetime(df['date']).dt.week.

          date  variable      value  week_number
0    2020-11-4  quantity   564.0000           45  --> to keep
1   2020-11-11  quantity   565.0000           46  --> to keep
2   2020-11-18  quantity   566.0000           47  --> to keep
3   2020-11-25  quantity   566.0000           48  --> to keep
4    2020-11-2     price  1829.1039           45
5    2020-11-3     price  1789.5883           45
6    2020-11-4     price  1755.4307           45
7    2020-11-5     price  1750.0727           45
8    2020-11-6     price  1746.7239           45  --> to keep, since it's the largest weekday for this week
9    2020-11-9     price  1756.1005           46
10  2020-11-10     price  1752.0820           46
11  2020-11-11     price  1814.3693           46
12  2020-11-12     price  1833.7922           46
13  2020-11-13     price  1833.7922           46  --> to keep, since it's the largest weekday for this week
14  2020-11-16     price  1784.2302           47
15  2020-11-17     price  1764.1376           47
16  2020-11-18     price  1770.1654           47
17  2020-11-19     price  1757.4400           47
18  2020-11-20     price  1770.1654           47  --> to keep, since it's the largest weekday for this week

Finally, I will reshape rows indicating to_keep to the expected result as follow:

   variable  the_45th_week  the_46th_week  the_47th_week  the_48th_week
0  quantity       564.0000       565.0000       566.0000          566.0
1     price      1756.1005      1833.7922      1770.1654            NaN

How could I manipulate data to get the expected result? Sincere thanks.

EDIT:

df = df.sort_values(by=['variable','date'], ascending=False)
df.drop_duplicates(['variable', 'week_number'], keep='last')

Out:

          date  variable      value  week_number
0    2020-11-4  quantity   564.0000           45
3   2020-11-25  quantity   566.0000           48
2   2020-11-18  quantity   566.0000           47
1   2020-11-11  quantity   565.0000           46
4    2020-11-2     price  1829.1039           45
14  2020-11-16     price  1784.2302           47
10  2020-11-10     price  1752.0820           46

CodePudding user response:

In your solution is possible add pivot with rename:

df['week_number'] = pd.to_datetime(df['date']).dt.week
df = df.sort_values(by=['variable','date'], ascending=False)
df = df.drop_duplicates(['variable', 'week_number'], keep='last')

f = lambda x: f'the_{x}th_week'
out = df.pivot('variable','week_number','value').rename(columns=f)
print(out)
week_number  the_45th_week  the_46th_week  the_47th_week  the_48th_week
variable                                                               
price            1829.1039       1752.082      1784.2302            NaN
quantity          564.0000        565.000       566.0000          566.0

Or remove DataFrame.drop_duplicates, so is possible use DataFrame.pivot_table with aggregate function last:

df['week_number'] = pd.to_datetime(df['date']).dt.week
df = df.sort_values(by=['variable','date'], ascending=False)

f = lambda x: f'the_{x}th_week'
out = df.pivot_table(index='variable',columns='week_number',values='value', aggfunc='last').rename(columns=f)

EDIT: to get an exact same result as the expected one:

out.reset_index().rename_axis(None, axis=1)

Out:

   variable  the_45th_week  the_46th_week  the_47th_week  the_48th_week
0     price      1829.1039       1752.082      1784.2302            NaN
1  quantity       564.0000        565.000       566.0000          566.0
  • Related