I have a dataframe with stock data for multiple companies that looks like this:
dict = {'company': {201156: 18673,
201157: 18673,
201158: 18673,
201159: 18673,
201160: 18673,
201161: 18673,
201162: 18673,
201162: 18673},
'priceClose': {201156: 18.1,
201157: 17.85,
201158: 17.97,
201159: 18.29,
201160: 18.74,
201161: 18.5,
201162: 18.35,
201162: 18.54},
'pricingDate': {201156: Timestamp('2020-09-21 00:00:00'),
201157: Timestamp('2020-09-22 00:00:00'),
201158: Timestamp('2020-09-23 00:00:00'),
201159: Timestamp('2020-09-24 00:00:00'),
201160: Timestamp('2020-09-25 00:00:00'),
201161: Timestamp('2020-09-28 00:00:00'),
201162: Timestamp('2020-09-29 00:00:00'),
201162: Timestamp('2020-09-30 00:00:00')},
'industryName': {201156: 'Energy',
201157: 'Energy',
201158: 'Energy',
201159: 'Energy',
201160: 'Energy',
201161: 'Energy',
201162: 'Energy',
201162: 'Energy'},
'indexCode': {201156: 'sp500',
201157: 'sp500',
201158: 'sp500',
201159: 'sp500',
201160: 'sp500',
201161: 'sp500',
201162: 'sp500',
201162: 'sp500'}}
df = pd.DataFrame(dict)
I want to basically generate a df with all the columns in the current df, but the priceClose
should be the month end value value. So for the example shown, the priceClose
for company 18673
should be 18.54
as it's the last value for that month.
I tried the following code using pd.resample
but it doesn't seem to work with a multiindex.
df.set_index(['pricingDate','company','industryName','indexCode'])['priceClose'].resample('M').ffill()
Any suggestions on how I can make this work?
CodePudding user response:
If the dates are sorted, use groupby.transform
:
g = pd.to_datetime(df['pricingDate']).dt.to_period('M')
df['priceClose'] = df.groupby(g)['priceClose'].transform('last')
Output:
company priceClose pricingDate industryName indexCode
201156 18673 18.54 2020-09-21 Energy sp500
201157 18673 18.54 2020-09-22 Energy sp500
201158 18673 18.54 2020-09-23 Energy sp500
201159 18673 18.54 2020-09-24 Energy sp500
201160 18673 18.54 2020-09-25 Energy sp500
201161 18673 18.54 2020-09-28 Energy sp500
201162 18673 18.54 2020-09-30 Energy sp500
If the dates are not in order, you can sort_values
:
g = pd.to_datetime(df['pricingDate']).dt.to_period('M')
df['priceClose'] = df.sort_values(by='priceClose').groupby(g)['priceClose'].transform('last')
Or without sorting, with idxmax
and a double map
:
g = pd.to_datetime(df['pricingDate']).dt.to_period('M')
s = df['priceClose'].groupby(g).idxmax()
df['priceClose'] = g.map(s.map(df['priceClose']))
Example on the shuffled input:
company priceClose pricingDate industryName indexCode
201158 18673 18.74 2020-09-23 Energy sp500
201161 18673 18.74 2020-09-28 Energy sp500
201156 18673 18.74 2020-09-21 Energy sp500
201162 18673 18.74 2020-09-30 Energy sp500
201160 18673 18.74 2020-09-25 Energy sp500
201157 18673 18.74 2020-09-22 Energy sp500
201159 18673 18.74 2020-09-24 Energy sp500
CodePudding user response:
I modified your example set slightly to include two months. Grouping by month and year, finding the last value in a separate column (lastClose), and filling upwards solves the problem.
df.assign(lastPrice = df.groupby([df.pricingDate.dt.month, df.pricingDate.dt.year])
.tail(1).priceClose).bfill()
Output:
company priceClose pricingDate industryName indexCode month year lastClose
0 18673 18.1 2020-09-21 Energy sp500 9 2020 18.29
1 18673 17.85 2020-09-22 Energy sp500 9 2020 18.29
2 18673 17.97 2020-09-23 Energy sp500 9 2020 18.29
3 18673 18.29 2020-09-24 Energy sp500 9 2020 18.29
4 18673 18.74 2020-10-25 Energy sp500 10 2020 18.54
5 18673 18.5 2020-10-28 Energy sp500 10 2020 18.54
6 18673 18.54 2020-10-30 Energy sp500 10 2020 18.54