Home > Software engineering >  Easiest way to create variables by substracting elements from group? e.g., value of each row of each
Easiest way to create variables by substracting elements from group? e.g., value of each row of each

Time:05-20

In the following example, what would be the best to group so that there could be a new column that is formed by taking first year in each group and subtracting by current year. For example in in row with index 0 it would be NaN, row with index 1 , it would = 1, row with index 2 it would = 3, row with index 4 = 1 and so forth.

>>> import pandas as pd
>>> df = pd.DataFrame({'id': ['1', '1', '1', '2', '2', '3', '4', '4'],
...                    'Year': [2000, 2001, 2003, 2004, 2005, 2002, 2001, 2003]})
>>> print(df)
  id  Year
0  1  2000
1  1  2001
2  1  2003
3  2  2004
4  2  2005
5  3  2002
6  4  2001
7  4  2003

CodePudding user response:

Transform Year with first to get the first year per id, then subtract this from Year column to get difference, finally mask the values where difference is 0:

s = df['Year'] - df.groupby('id')['Year'].transform('first')
df['col'] = s.mask(s == 0)

  id  Year  col
0  1  2000  NaN
1  1  2001  1.0
2  1  2003  3.0
3  2  2004  NaN
4  2  2005  1.0
5  3  2002  NaN
6  4  2001  NaN
7  4  2003  2.0

CodePudding user response:

Although the answer from @Shubham is shorter than my answer, I want to put here another approach:

1. Looking for the firsts of each group

You need to know what are each first value of each group, for that reason, you are going to use Groupby.first method:

firsts = df.groupby('id').first()
print(first)

The first variable is a dataframe with this structure:

    Year
id  
1   2000
2   2004
3   2002
4   2001

2. Looking how many values have each group

howmany = df.groupby('id').count().rename(columns={'Year': 'Howmany'})
print(howmany)

The howmany variable is a dataframe with this structure:

    Howmany
id  
1     3
2     2
3     1
4     2

3. Create an information dataframe

information = firsts.join(howmany).reset_index()
print(information)

The information variable is a dataframe with this structure:

   id   Year    Howmany
0   1   2000      3
1   2   2004      2
2   3   2002      1
3   4   2001      2

4. Create a column with each first value of each group

lists_df = []
for value in information.itertuples():
    lists_df.append(pd.DataFrame(data=[value.Year]*value.Howmany))

df['col'] = pd.concat(lists_df).reset_index(drop=True)

So far, our df looks like this:

    id  Year     col
0   1   2000    2000
1   1   2001    2000
2   1   2003    2000
3   2   2004    2004
4   2   2005    2004
5   3   2002    2002
6   4   2001    2001
7   4   2003    2001

You can realize that col column has exactly the first value of each group

5. Use Groupby.shift to shift each group once

df['col'] = df['Year'] - df.groupby('id')['col'].shift()

print(df)

All code together!

firsts = df.groupby('id').first()
howmany = df.groupby('id').count().rename(columns={'Year': 'Howmany'})
information = firsts.join(howmany).reset_index()
lists_df = []
for value in information.itertuples():
    lists_df.append(pd.DataFrame(data=[value.Year]*value.Howmany))
df['col'] = pd.concat(lists_df).reset_index(drop=True)
df['col'] = df['Year'] - df.groupby('id')['col'].shift()
print(df)
  • Related