I am trying to calculate the difference between the last and third value for specific month and year for each column (ex1, ex2,...) in my dataset and then save into new dataframe.
My dataset looks like this:
month | year | ex1 | ex2 |
---|---|---|---|
12 | 1995 | 55 | 55 |
12 | 1995 | 46 | 33 |
12 | 1995 | 33 | 12 |
12 | 1995 | 15 | 17 |
12 | 1995 | 6 | 16 |
12 | 1995 | 35 | 32 |
12 | 1995 | 67 | 22 |
12 | 1995 | 43 | 25 |
12 | 1995 | 31 | 26 |
12 | 1995 | 34 | 11 |
12 | 1995 | 53 | 14 |
12 | 1995 | 72 | 60 |
1 | 1996 | 34 | 90 |
1 | 1996 | 55 | 14 |
1 | 1996 | 58 | 24 |
1 | 1996 | 54 | 23 |
1 | 1996 | 33 | 20 |
1 | 1996 | 24 | 45 |
1 | 1996 | 23 | 33 |
1 | 1996 | 15 | 38 |
1 | 1996 | 11 | 50 |
1 | 1996 | 79 | 55 |
1 | 1996 | 80 | 71 |
1 | 1996 | 88 | 74 |
So for example, I want to calculate for ex1:
Last value of month 12 and year 1995 -> 72
Third value of month 12 and year 1995 -> 33
Difference: 72 - 33 = 39
Last value of month 1 and year 1996 -> 88
Third value of month 1 and year 1996 -> 58
Difference: 88 - 58 = 30
For ex2:
Last value of month 12 and year 1995 -> 88
Third value of month 12 and year 1995 -> 58
Difference: 88 - 58 = 30
Last value of month 1 and year 1996 -> 74
Third value of month 1 and year 1996 -> 24
Difference: 74 - 24 = 50
After that, I want to write these values a new dataframe as follow:
Name | Month | Year | Difference |
---|---|---|---|
ex1 | 12 | 1995 | 39 |
ex1 | 1 | 1996 | 30 |
ex2 | 12 | 1995 | 30 |
ex2 | 1 | 1996 | 50 |
I tried to do the following:
df = pd.read_excel ('dataset') #See above
df_new = pd.DataFrame(columns=["Name","Month","Year","Difference"])
for colname, colitems in df.iloc.[:,2:].iteritems():
for rownum, rowitem in colitems.iteritems():
if (df["month"][rownum] != df["month"][rownum 1]) & (df["year"][rownum] != df["year"][rownum 1]):
last= df[colname][rownum]
third= df[colname][2]
diff = last - third
df_new.assign(name = df[colname][rownum], month = df["month"][rownum],
year= df["year"][rownum], difference= diff)
When I try to run this, I had the KeyError: 79 in the if clause.
Also, my code calculates the difference only for month 12 and year 1995 for ex1. How can I move on to next month and year and the next value: ex2. Am I in the right direction? Do you have any suggestions that would help my problem?
If there's anything I've left out that would be helpful to solving this, please let me know so I can rectify it immediately.
CodePudding user response:
The basic operation you are looking for is a one-liner:
res = df.groupby(['month','year']).last() - df.groupby(['month','year']).nth(2)
print(res)
ex1 ex2
month year
1 1996 30 50
12 1995 39 48
To get this into the shape of your desired output, try as follows:
res.reset_index(drop=False,inplace=True)
res.sort_values(['year','month'], inplace=True)
res = pd.melt(res,
id_vars=['month','year'],
value_vars=['ex1','ex2'],
var_name = 'name',
value_name='difference')
res = res.loc[:, ['name','month','year','difference']]
print(res)
name month year difference
0 ex1 12 1995 39
1 ex1 1 1996 30
2 ex2 12 1995 48
3 ex2 1 1996 50
CodePudding user response:
grp = df.groupby(['Month', 'Year'])
Then the following will return the differences you desire,
grp.last() - grp.nth(2)
No looping required :)
EDIT: here is the synonymous example I tested -- the answer is not to the requested shape, but the data is good
df = pd.DataFrame(
{
"month": [1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2],
"year": [1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000],
"ex1": [11, 22, 33, 44, 55, 66, 77, 88, 22, 44, 66, 88, 1010, 1212, 1414, 1616],
"ex2": [22, 44, 66, 88, 1010, 1212, 1414, 1616, 11, 22, 33, 44, 55, 66, 77, 88],
}
)
Returns the following:
>>>
ex1 ex2
month year
1 1999 11 22
2000 22 11
2 1999 11 202
2000 202 11