I am looking to calculate the percentage increase or decrease between the first and last non-na value for the following dataset:
Year | Company | Data |
---|---|---|
2019 | X | 341976.00 |
2020 | X | 1.000 |
2021 | X | 282872.00 |
2019 | Y | NaN |
2020 | Y | NaN |
2021 | Y | NaN |
2019 | Z | 4394.00 |
2020 | Z | 173.70 |
2021 | Z | 518478.00 |
As I want the relative change I would expect the formula to do something like:
(last non-na value)/(first non-na value)-1
This should return something like:
Year | Company | Data | Data |
---|---|---|---|
2019 | X | 341976.00 | NaN |
2020 | X | 1.000 | NaN |
2021 | X | 282872.00 | -0.17 |
2019 | Y | NaN | NaN |
2020 | Y | NaN | NaN |
2021 | Y | NaN | NaN |
2019 | Z | 4394.00 | NaN |
2020 | Z | 173.70 | NaN |
2021 | Z | 518478.00 | 11.700 |
I have tried to combine groupby based on the company field with the first_valid_index but havent had any luck finding a solution. What is the most efficient way of calculating the relative change as above?
CodePudding user response:
If aggregate GroupBy.first
and
GroupBy.last
it omit missing values, so is possible divide values and subtract 1
:
s = df.groupby('Company')['Data'].agg(['last','first']).eval('last / first').sub(1)
Then found index values for last non missing values per Company
:
idx = df.dropna(subset=['Data']).drop_duplicates(['Company'], keep='last').index
And mapping only matchded rows by Series.map
:
df.loc[idx, 'Date'] = df.loc[idx, 'Company'].map(s)
print (df)
Year Company Data Date
0 2019 X 341976.0 NaN
1 2020 X 1.0 NaN
2 2021 X 282872.0 -0.172831
3 2019 Y NaN NaN
4 2020 Y NaN NaN
5 2021 Y NaN NaN
6 2019 Z 4394.0 NaN
7 2020 Z 173.7 NaN
8 2021 Z 518478.0 116.996814
CodePudding user response:
To find first non-na value u can:
- iterate from first to last element of column and break if u value is not
np.nan
, - use
.dropna
method on dataframe and gets 1st element from the result df.
To find last:
- iterate from the last to first and (just like above),
- use
dropna
and gets value from last row