Home > Net >  Create and calculate new rows based on other rows condition
Create and calculate new rows based on other rows condition

Time:09-20

I have the dataframe as follow:

dataframe generator:

df = pd.DataFrame({
    'year':[2000,2001,2002]*3,
    'id':['a']*3 ['b']*3 ['c']*3,
    'othernulcol': ['xyz']*3 [np.nan]*4 ['tyu']*2,
    'val':[np.nan,2,3,4,5,6,7,8,9]
})

data looks like:

   year id othernulcol  val
0  2000  a         xyz  NaN
1  2001  a         xyz  2.0
2  2002  a         xyz  3.0
3  2000  b         NaN  4.0
4  2001  b         NaN  5.0
5  2002  b         NaN  6.0
6  2000  c         NaN  7.0
7  2001  c         tyu  8.0
8  2002  c         tyu  9.0

I want to create new 3 rows from 2000 to 2002 that is the sum of row with id = a and b in the same year. othernulcol is just other column in dataframe. When creating new rows, just set those cols as np.NaN

Expected output:

    year  id othernulcol   val
0   2000   a         xyz   NaN
1   2001   a         xyz   2.0
2   2002   a         xyz   3.0
3   2000   b         NaN   4.0
4   2001   b         NaN   5.0
5   2002   b         NaN   6.0
6   2000   c         NaN   7.0
7   2001   c         tyu   8.0
8   2002   c         tyu   9.0
9   2000  ab         NaN   NaN
10  2001  ab         NaN  10.0
11  2002  ab         NaN  12.0

Thank you for reading

CodePudding user response:

Filter values by categories and convert year to index for align same years from another DataFrame, sum values by DataFrame.add and append to original DataFrame by concat:

cols = ['id','val']

df1 = df[df['id'].eq('a')].set_index('year')[cols]
df2 = df[df['id'].eq('b')].set_index('year')[cols]

df = pd.concat([df, df1.add(df2).reset_index()], ignore_index=True)
print (df)
    year  id othernulcol  val
0   2000   a         xyz  NaN
1   2001   a         xyz  2.0
2   2002   a         xyz  3.0
3   2000   b         NaN  4.0
4   2001   b         NaN  5.0
5   2002   b         NaN  6.0
6   2000   c         NaN  7.0
7   2001   c         tyu  8.0
8   2002   c         tyu  9.0
9   2000  ab         NaN  NaN
10  2001  ab         NaN  7.0
11  2002  ab         NaN  9.0

CodePudding user response:

Another solution could be as follows:

  • Select rows from df with df.id.isin(['a','b'] (see Series.isin) and apply df.groupby to year.
  • For the aggregration, use sum for column id. For column val use a lambda function to apply Series.sum, which allows skipna=False.
  • Finally, use pd.concat to add the result to the original df with ignoring the index.
out = pd.concat([df,df[df.id.isin(['a','b'])]\
                 .groupby('year', as_index=False)\
                     .agg({'id':'sum', 
                           'val':lambda x: x.sum(skipna=False)})], 
                ignore_index=True)
    
print(out)

    year  id othernulcol  val
0   2000   a         xyz  NaN
1   2001   a         xyz  2.0
2   2002   a         xyz  3.0
3   2000   b         NaN  4.0
4   2001   b         NaN  5.0
5   2002   b         NaN  6.0
6   2000   c         NaN  7.0
7   2001   c         tyu  8.0
8   2002   c         tyu  9.0
9   2000  ab         NaN  NaN
10  2001  ab         NaN  7.0
11  2002  ab         NaN  9.0
  • Related