Create and calculate new rows based on other rows condition-CodePudding

I have the dataframe as follow:

dataframe generator:

df = pd.DataFrame({
    'year':[2000,2001,2002]*3,
    'id':['a']*3 ['b']*3 ['c']*3,
    'othernulcol': ['xyz']*3 [np.nan]*4 ['tyu']*2,
    'val':[np.nan,2,3,4,5,6,7,8,9]
})

data looks like:

   year id othernulcol  val
0  2000  a         xyz  NaN
1  2001  a         xyz  2.0
2  2002  a         xyz  3.0
3  2000  b         NaN  4.0
4  2001  b         NaN  5.0
5  2002  b         NaN  6.0
6  2000  c         NaN  7.0
7  2001  c         tyu  8.0
8  2002  c         tyu  9.0

I want to create new 3 rows from 2000 to 2002 that is the sum of row with id = a and b in the same year. othernulcol is just other column in dataframe. When creating new rows, just set those cols as np.NaN

Expected output:

    year  id othernulcol   val
0   2000   a         xyz   NaN
1   2001   a         xyz   2.0
2   2002   a         xyz   3.0
3   2000   b         NaN   4.0
4   2001   b         NaN   5.0
5   2002   b         NaN   6.0
6   2000   c         NaN   7.0
7   2001   c         tyu   8.0
8   2002   c         tyu   9.0
9   2000  ab         NaN   NaN
10  2001  ab         NaN  10.0
11  2002  ab         NaN  12.0

Thank you for reading

CodePudding user response：

Filter values by categories and convert year to index for align same years from another DataFrame, sum values by DataFrame.add and append to original DataFrame by concat:

cols = ['id','val']

df1 = df[df['id'].eq('a')].set_index('year')[cols]
df2 = df[df['id'].eq('b')].set_index('year')[cols]

df = pd.concat([df, df1.add(df2).reset_index()], ignore_index=True)
print (df)
    year  id othernulcol  val
0   2000   a         xyz  NaN
1   2001   a         xyz  2.0
2   2002   a         xyz  3.0
3   2000   b         NaN  4.0
4   2001   b         NaN  5.0
5   2002   b         NaN  6.0
6   2000   c         NaN  7.0
7   2001   c         tyu  8.0
8   2002   c         tyu  9.0
9   2000  ab         NaN  NaN
10  2001  ab         NaN  7.0
11  2002  ab         NaN  9.0

CodePudding user response：

Another solution could be as follows:

Select rows from df with df.id.isin(['a','b'] (see Series.isin) and apply df.groupby to year.
For the aggregration, use sum for column id. For column val use a lambda function to apply Series.sum, which allows skipna=False.
Finally, use pd.concat to add the result to the original df with ignoring the index.

out = pd.concat([df,df[df.id.isin(['a','b'])]\
                 .groupby('year', as_index=False)\
                     .agg({'id':'sum', 
                           'val':lambda x: x.sum(skipna=False)})], 
                ignore_index=True)
    
print(out)

    year  id othernulcol  val
0   2000   a         xyz  NaN
1   2001   a         xyz  2.0
2   2002   a         xyz  3.0
3   2000   b         NaN  4.0
4   2001   b         NaN  5.0
5   2002   b         NaN  6.0
6   2000   c         NaN  7.0
7   2001   c         tyu  8.0
8   2002   c         tyu  9.0
9   2000  ab         NaN  NaN
10  2001  ab         NaN  7.0
11  2002  ab         NaN  9.0