Given a sample data as follows, I hope to extract one data entry for each week, if for the week having multiple entries, then I will use the largest weekday's data as for that week:
date variable value
0 2020-11-4 quantity 564.0000
1 2020-11-11 quantity 565.0000
2 2020-11-18 quantity 566.0000
3 2020-11-25 quantity 566.0000
4 2020-11-2 price 1829.1039
5 2020-11-3 price 1789.5883
6 2020-11-4 price 1755.4307
7 2020-11-5 price 1750.0727
8 2020-11-6 price 1746.7239
9 2020-11-9 price 1756.1005
10 2020-11-10 price 1752.0820
11 2020-11-11 price 1814.3693
12 2020-11-12 price 1833.7922
13 2020-11-13 price 1833.7922
14 2020-11-16 price 1784.2302
15 2020-11-17 price 1764.1376
16 2020-11-18 price 1770.1654
17 2020-11-19 price 1757.4400
18 2020-11-20 price 1770.1654
To get week number of each date, I use df['week_number'] = pd.to_datetime(df['date']).dt.week
.
date variable value week_number
0 2020-11-4 quantity 564.0000 45 --> to keep
1 2020-11-11 quantity 565.0000 46 --> to keep
2 2020-11-18 quantity 566.0000 47 --> to keep
3 2020-11-25 quantity 566.0000 48 --> to keep
4 2020-11-2 price 1829.1039 45
5 2020-11-3 price 1789.5883 45
6 2020-11-4 price 1755.4307 45
7 2020-11-5 price 1750.0727 45
8 2020-11-6 price 1746.7239 45 --> to keep, since it's the largest weekday for this week
9 2020-11-9 price 1756.1005 46
10 2020-11-10 price 1752.0820 46
11 2020-11-11 price 1814.3693 46
12 2020-11-12 price 1833.7922 46
13 2020-11-13 price 1833.7922 46 --> to keep, since it's the largest weekday for this week
14 2020-11-16 price 1784.2302 47
15 2020-11-17 price 1764.1376 47
16 2020-11-18 price 1770.1654 47
17 2020-11-19 price 1757.4400 47
18 2020-11-20 price 1770.1654 47 --> to keep, since it's the largest weekday for this week
Finally, I will reshape rows indicating to_keep
to the expected result as follow:
variable the_45th_week the_46th_week the_47th_week the_48th_week
0 quantity 564.0000 565.0000 566.0000 566.0
1 price 1756.1005 1833.7922 1770.1654 NaN
How could I manipulate data to get the expected result? Sincere thanks.
EDIT:
df = df.sort_values(by=['variable','date'], ascending=False)
df.drop_duplicates(['variable', 'week_number'], keep='last')
Out:
date variable value week_number
0 2020-11-4 quantity 564.0000 45
3 2020-11-25 quantity 566.0000 48
2 2020-11-18 quantity 566.0000 47
1 2020-11-11 quantity 565.0000 46
4 2020-11-2 price 1829.1039 45
14 2020-11-16 price 1784.2302 47
10 2020-11-10 price 1752.0820 46
CodePudding user response:
In your solution is possible add pivot
with rename
:
df['week_number'] = pd.to_datetime(df['date']).dt.week
df = df.sort_values(by=['variable','date'], ascending=False)
df = df.drop_duplicates(['variable', 'week_number'], keep='last')
f = lambda x: f'the_{x}th_week'
out = df.pivot('variable','week_number','value').rename(columns=f)
print(out)
week_number the_45th_week the_46th_week the_47th_week the_48th_week
variable
price 1829.1039 1752.082 1784.2302 NaN
quantity 564.0000 565.000 566.0000 566.0
Or remove DataFrame.drop_duplicates
, so is possible use DataFrame.pivot_table
with aggregate function last
:
df['week_number'] = pd.to_datetime(df['date']).dt.week
df = df.sort_values(by=['variable','date'], ascending=False)
f = lambda x: f'the_{x}th_week'
out = df.pivot_table(index='variable',columns='week_number',values='value', aggfunc='last').rename(columns=f)
EDIT: to get an exact same result as the expected one:
out.reset_index().rename_axis(None, axis=1)
Out:
variable the_45th_week the_46th_week the_47th_week the_48th_week
0 price 1829.1039 1752.082 1784.2302 NaN
1 quantity 564.0000 565.000 566.0000 566.0