In the following example, what would be the best to group so that there could be a new column that is formed by taking first year in each group and subtracting by current year. For example in in row with index 0 it would be NaN, row with index 1 , it would = 1, row with index 2 it would = 3, row with index 4 = 1 and so forth.
>>> import pandas as pd
>>> df = pd.DataFrame({'id': ['1', '1', '1', '2', '2', '3', '4', '4'],
... 'Year': [2000, 2001, 2003, 2004, 2005, 2002, 2001, 2003]})
>>> print(df)
id Year
0 1 2000
1 1 2001
2 1 2003
3 2 2004
4 2 2005
5 3 2002
6 4 2001
7 4 2003
CodePudding user response:
Transform Year
with first
to get the first year per id
, then subtract this from Year
column to get difference, finally mask
the values where difference is 0
:
s = df['Year'] - df.groupby('id')['Year'].transform('first')
df['col'] = s.mask(s == 0)
id Year col
0 1 2000 NaN
1 1 2001 1.0
2 1 2003 3.0
3 2 2004 NaN
4 2 2005 1.0
5 3 2002 NaN
6 4 2001 NaN
7 4 2003 2.0
CodePudding user response:
Although the answer from @Shubham is shorter than my answer, I want to put here another approach:
1. Looking for the firsts of each group
You need to know what are each first value of each group, for that reason, you are going to use Groupby.first
method:
firsts = df.groupby('id').first()
print(first)
The first
variable is a dataframe with this structure:
Year
id
1 2000
2 2004
3 2002
4 2001
2. Looking at how many values have each group
For this, you will use Groupby.count
method
howmany = df.groupby('id').count().rename(columns={'Year': 'Howmany'})
print(howmany)
The howmany
variable is a dataframe with this structure:
Howmany
id
1 3
2 2
3 1
4 2
3. Create an information dataframe
Information will be an auxiliary dataframe with information about what is the first Year
of each group and how many values has that group. You need to use a simple Dataframe.join
method
information = firsts.join(howmany).reset_index()
print(information)
The information
variable is a dataframe with this structure:
id Year Howmany
0 1 2000 3
1 2 2004 2
2 3 2002 1
3 4 2001 2
4. Create a column with each first value of each group
You are going to create a new column called col
which has the first Year
of each group repeated according to the number of values.
lists_df = []
for value in information.itertuples():
lists_df.append(pd.DataFrame(data=[value.Year]*value.Howmany))
df['col'] = pd.concat(lists_df).reset_index(drop=True)
So far, our df
looks like this:
id Year col
0 1 2000 2000 <--- First value of group 1, it is repeated 3 times because group 1 has 3 values
1 1 2001 2000
2 1 2003 2000
3 2 2004 2004 <--- First value of group 2, it is repeated 2 times because group 2 has 2 values
4 2 2005 2004
5 3 2002 2002
6 4 2001 2001
7 4 2003 2001
You can realize that col
column has exactly the first value of each group
5. Use Groupby.shift
to shift col
column by each group
So, if you do
df.groupby('id')['col'].shift()
Then, you will get:
col
0 NaN
1 2000.0
2 2000.0
3 NaN
4 2004.0
5 NaN
6 NaN
7 2001.0
Finally, a simple vectorized substraction is enough, between Year
and col
column:
df['col'] = df['Year'] - df.groupby('id')['col'].shift()
All code together!
firsts = df.groupby('id').first()
howmany = df.groupby('id').count().rename(columns={'Year': 'Howmany'})
information = firsts.join(howmany).reset_index()
lists_df = []
for value in information.itertuples():
lists_df.append(pd.DataFrame(data=[value.Year]*value.Howmany))
df['col'] = pd.concat(lists_df).reset_index(drop=True)
df['col'] = df['Year'] - df.groupby('id')['col'].shift()
print(df)
Output:
id Year col
0 1 2000 NaN
1 1 2001 1.0
2 1 2003 3.0
3 2 2004 NaN
4 2 2005 1.0
5 3 2002 NaN
6 4 2001 NaN
7 4 2003 2.0