In the following example, what would be the best to group so that there could be a new column that is formed by taking first year in each group and subtracting by current year. For example in in row with index 0 it would be NaN, row with index 1 , it would = 1, row with index 2 it would = 3, row with index 4 = 1 and so forth.
>>> import pandas as pd
>>> df = pd.DataFrame({'id': ['1', '1', '1', '2', '2', '3', '4', '4'],
... 'Year': [2000, 2001, 2003, 2004, 2005, 2002, 2001, 2003]})
>>> print(df)
id Year
0 1 2000
1 1 2001
2 1 2003
3 2 2004
4 2 2005
5 3 2002
6 4 2001
7 4 2003
CodePudding user response:
Transform Year
with first
to get the first year per id
, then subtract this from Year
column to get difference, finally mask
the values where difference is 0
:
s = df['Year'] - df.groupby('id')['Year'].transform('first')
df['col'] = s.mask(s == 0)
id Year col
0 1 2000 NaN
1 1 2001 1.0
2 1 2003 3.0
3 2 2004 NaN
4 2 2005 1.0
5 3 2002 NaN
6 4 2001 NaN
7 4 2003 2.0
CodePudding user response:
Although the answer from @Shubham is shorter than my answer, I want to put here another approach:
1. Looking for the firsts of each group
You need to know what are each first value of each group, for that reason, you are going to use Groupby.first method:
firsts = df.groupby('id').first()
print(first)
The first
variable is a dataframe with this structure:
Year
id
1 2000
2 2004
3 2002
4 2001
2. Looking how many values have each group
howmany = df.groupby('id').count().rename(columns={'Year': 'Howmany'})
print(howmany)
The howmany
variable is a dataframe with this structure:
Howmany
id
1 3
2 2
3 1
4 2
3. Create an information dataframe
information = firsts.join(howmany).reset_index()
print(information)
The information
variable is a dataframe with this structure:
id Year Howmany
0 1 2000 3
1 2 2004 2
2 3 2002 1
3 4 2001 2
4. Create a column with each first value of each group
lists_df = []
for value in information.itertuples():
lists_df.append(pd.DataFrame(data=[value.Year]*value.Howmany))
df['col'] = pd.concat(lists_df).reset_index(drop=True)
So far, our df
looks like this:
id Year col
0 1 2000 2000
1 1 2001 2000
2 1 2003 2000
3 2 2004 2004
4 2 2005 2004
5 3 2002 2002
6 4 2001 2001
7 4 2003 2001
You can realize that col
column has exactly the first value of each group
5. Use Groupby.shift
to shift each group once
df['col'] = df['Year'] - df.groupby('id')['col'].shift()
print(df)
All code together!
firsts = df.groupby('id').first()
howmany = df.groupby('id').count().rename(columns={'Year': 'Howmany'})
information = firsts.join(howmany).reset_index()
lists_df = []
for value in information.itertuples():
lists_df.append(pd.DataFrame(data=[value.Year]*value.Howmany))
df['col'] = pd.concat(lists_df).reset_index(drop=True)
df['col'] = df['Year'] - df.groupby('id')['col'].shift()
print(df)