How to add rows in one column based on repeated values in another column , and finally keep the firs-CodePudding

I am very new to the python pandas module.

Suppose I have a data frame or table as follows:

 df = pd.DataFrame({
        'Column A': [12,12,12, 15, 16, 141, 141, 141, 141],
         'Column B':['Apple' ,'Apple' ,'Apple' , 'Red', 'Blue', 'Yellow', 'Yellow', 'Yellow', 'Yellow'],
        'Column C':[100, 50, np.nan , 23 , np.nan , 199 , np.nan , 1,np.nan]
    })

or I have a data table as follows:


    | Column A | Column B |Column C 
----| -------- | ---------|--------
0   | 12       | Apple    |100     
1   | 12       | Apple    |50      
2   | 12       | Apple    |NaN      
3   | 15       | Red      |23       
4   | 16       | Blue     |NaN      
5   | 141      | Yellow   |199      
6   | 141      | Yellow   |NaN      
7   | 141      | Yellow   |1        
8   | 141      | Yellow   |NaN

If values in column A are repeated then add the corresponding values in Column C and paste the sum in new column D (For example, there are 3 rows for 12, thus we should add corresponding values 100 50 NaN, and the sum result 150 should be stored in new column D).
If values in column A do not repeat directly paste Column C values in new column D (such as row 3) but for NaN, it should be 0 (such as row 4).

Could you please help me to get an output like this in python jupyter notebook:

      | Column A | Column B |Column C |Column D 
----- | -------- | ---------|---------|---------
 0    | 12       | Apple    |100      |150      
 1    | 15       | Red      |23       |23       
 2    | 16       | Blue     |NaN      |0        
 3    | 141      | Yellow   |199      |200

CodePudding user response：

df.groupby("Column A", as_index=False).agg(B=("Column B", "first"),
                                           C=("Column C", "first"),
                                           D=("Column C", "sum"))
#      Column A         B         C         D
# 0          12     Apple     100.0     150.0
# 1          15       Red      23.0      23.0
# 2          16      Blue       NaN       0.0
# 3         141    Yellow     199.0     200.0

CodePudding user response：

Here is one approach

df['Column D'] = df.groupby('Column A')['Column C'].transform('sum')
df = df.drop_duplicates('Column A')

The groupby('Column A') creates a group of rows for each unique value in Column A. Then the ['Column C'].transform('sum') adds the C values for all the rows in that groups.

This sum gets saved to Column D, and then we can drop the duplicates to just keep the first

There are some assumptions in this solution. It will group together all 12 values in A even if they aren't right after each other, which may or may not be what you want.

CodePudding user response：

df = df.set_index(df['Column A']).drop('Column A', axis=1)
df['Column D'] = df.groupby('Column A')['Column C'].sum()
df = df.drop_duplicates(subset=['Column B'])