How to create variables by substracting elements from group? e.g., value of each row of each group --CodePudding

In the following example, what would be the best to group so that there could be a new column that is formed by taking first year in each group and subtracting by current year. For example in in row with index 0 it would be NaN, row with index 1 , it would = 1, row with index 2 it would = 3, row with index 4 = 1 and so forth.

>>> import pandas as pd
>>> df = pd.DataFrame({'id': ['1', '1', '1', '2', '2', '3', '4', '4'],
...                    'Year': [2000, 2001, 2003, 2004, 2005, 2002, 2001, 2003]})
>>> print(df)
  id  Year
0  1  2000
1  1  2001
2  1  2003
3  2  2004
4  2  2005
5  3  2002
6  4  2001
7  4  2003

CodePudding user response：

Transform Year with first to get the first year per id, then subtract this from Year column to get difference, finally mask the values where difference is 0:

s = df['Year'] - df.groupby('id')['Year'].transform('first')
df['col'] = s.mask(s == 0)

  id  Year  col
0  1  2000  NaN
1  1  2001  1.0
2  1  2003  3.0
3  2  2004  NaN
4  2  2005  1.0
5  3  2002  NaN
6  4  2001  NaN
7  4  2003  2.0

CodePudding user response：

Although the answer from @Shubham is shorter than my answer, I want to put here another approach:

1. Looking for the firsts of each group

You need to know what are each first value of each group, for that reason, you are going to use Groupby.first method:

firsts = df.groupby('id').first()
print(first)

The first variable is a dataframe with this structure:

2. Looking at how many values have each group

For this, you will use Groupby.count method

howmany = df.groupby('id').count().rename(columns={'Year': 'Howmany'})
print(howmany)

The howmany variable is a dataframe with this structure:

3. Create an information dataframe

Information will be an auxiliary dataframe with information about what is the first Year of each group and how many values has that group. You need to use a simple Dataframe.join method

information = firsts.join(howmany).reset_index()
print(information)

The information variable is a dataframe with this structure:

   id   Year    Howmany
0   1   2000      3
1   2   2004      2
2   3   2002      1
3   4   2001      2

4. Create a column with each first value of each group

You are going to create a new column called col which has the first Year of each group repeated according to the number of values.

lists_df = []
for value in information.itertuples():
    lists_df.append(pd.DataFrame(data=[value.Year]*value.Howmany))

df['col'] = pd.concat(lists_df).reset_index(drop=True)

So far, our df looks like this:

    id  Year     col
0   1   2000    2000 <--- First value of group 1, it is repeated 3 times because group 1 has 3 values
1   1   2001    2000
2   1   2003    2000
3   2   2004    2004 <--- First value of group 2, it is repeated 2 times because group 2 has 2 values
4   2   2005    2004
5   3   2002    2002
6   4   2001    2001
7   4   2003    2001

You can realize that col column has exactly the first value of each group

5. Use `Groupby.shift` to shift `col` column by each group

So, if you do

df.groupby('id')['col'].shift()

Then, you will get:

        col
0       NaN
1    2000.0
2    2000.0
3       NaN
4    2004.0
5       NaN
6       NaN
7    2001.0

Finally, a simple vectorized substraction is enough, between Year and col column:

df['col'] = df['Year'] - df.groupby('id')['col'].shift()

All code together!

firsts = df.groupby('id').first()
howmany = df.groupby('id').count().rename(columns={'Year': 'Howmany'})
information = firsts.join(howmany).reset_index()
lists_df = []
for value in information.itertuples():
    lists_df.append(pd.DataFrame(data=[value.Year]*value.Howmany))
df['col'] = pd.concat(lists_df).reset_index(drop=True)
df['col'] = df['Year'] - df.groupby('id')['col'].shift()
print(df)

Output:

   id   Year    col
0   1   2000    NaN
1   1   2001    1.0
2   1   2003    3.0
3   2   2004    NaN
4   2   2005    1.0
5   3   2002    NaN
6   4   2001    NaN
7   4   2003    2.0