I am very new to the python pandas module.
Suppose I have a data frame or table as follows:
df = pd.DataFrame({
'Column A': [12,12,12, 15, 16, 141, 141, 141, 141],
'Column B':['Apple' ,'Apple' ,'Apple' , 'Red', 'Blue', 'Yellow', 'Yellow', 'Yellow', 'Yellow'],
'Column C':[100, 50, np.nan , 23 , np.nan , 199 , np.nan , 1,np.nan]
})
or I have a data table as follows:
| Column A | Column B |Column C
----| -------- | ---------|--------
0 | 12 | Apple |100
1 | 12 | Apple |50
2 | 12 | Apple |NaN
3 | 15 | Red |23
4 | 16 | Blue |NaN
5 | 141 | Yellow |199
6 | 141 | Yellow |NaN
7 | 141 | Yellow |1
8 | 141 | Yellow |NaN
If values in column A are repeated then add the corresponding values in Column C and paste the sum in new column D (For example, there are 3 rows for 12, thus we should add corresponding values 100 50 NaN, and the sum result 150 should be stored in new column D).
If values in column A do not repeat directly paste Column C values in new column D (such as row 3) but for NaN, it should be 0 (such as row 4).
Could you please help me to get an output like this in python jupyter notebook:
| Column A | Column B |Column C |Column D
----- | -------- | ---------|---------|---------
0 | 12 | Apple |100 |150
1 | 15 | Red |23 |23
2 | 16 | Blue |NaN |0
3 | 141 | Yellow |199 |200
CodePudding user response:
df.groupby("Column A", as_index=False).agg(B=("Column B", "first"),
C=("Column C", "first"),
D=("Column C", "sum"))
# Column A B C D
# 0 12 Apple 100.0 150.0
# 1 15 Red 23.0 23.0
# 2 16 Blue NaN 0.0
# 3 141 Yellow 199.0 200.0
CodePudding user response:
Here is one approach
df['Column D'] = df.groupby('Column A')['Column C'].transform('sum')
df = df.drop_duplicates('Column A')
The groupby('Column A')
creates a group of rows for each unique value in Column A
. Then the ['Column C'].transform('sum')
adds the C values for all the rows in that groups.
This sum gets saved to Column D, and then we can drop the duplicates to just keep the first
There are some assumptions in this solution. It will group together all 12
values in A even if they aren't right after each other, which may or may not be what you want.
CodePudding user response:
df = df.set_index(df['Column A']).drop('Column A', axis=1)
df['Column D'] = df.groupby('Column A')['Column C'].sum()
df = df.drop_duplicates(subset=['Column B'])