I have a dataset that looks like
A B year CompanyName
1 nan 1999 tesla
4 3 2000 tesla
Nan nan 2001 tesla
2 nan 2002 tesla
3 nan 1999 BMW
2 -1 2000 BMW
2 nan 2002 BMW
So I have different company data and I want to calculate a new column B which is equal to the difference between the current row and the previous one of column A following these conditions:
- the difference is done only if both consecutive years have valid values, if one year is missing then the difference should not be calculated
- if one of the company year value of A is missing then the difference column will assume nan value
- the difference has to be done for the same company, therefore i don't want the difference to be done for different companies
I wrote this code:
finalData[~(finalData.year.diff(1).isnull()) & (finalData.year.diff(1) ==1) &
(finalData.companyName)]
but I am stuck with the companyName condition, I appreciate your time and help
Ps. I order my data based on CompanyName then year to make this strange difference
CodePudding user response:
Assuming that your data is sorted by year (if not use df = df.sort_values(by='year')
), you can do:
g = df.groupby('CompanyName')
df['B'] = g['A'].diff().where(df['year'].eq(g['year'].shift() 1))
output:
A B year CompanyName
0 1.0 NaN 1999 tesla
1 4.0 3.0 2000 tesla
2 NaN NaN 2001 tesla
3 2.0 NaN 2002 tesla
4 3.0 NaN 1999 BMW
5 2.0 -1.0 2000 BMW
6 2.0 NaN 2002 BMW