How to sum duplicate rows with one specific column while keeping all other columns-CodePudding

I have over 14,000 rows of data and around 56 columns. One column is labeled as "ID." Some IDS (such as 20103) have duplicates since a person may have multiple entries. I wanted to combine these duplicate IDs by summing the cumulative column, but leaving the other columns intact. This is my data below. I modified it to remove the 14,000 rows and 56 columns for simplicity sake.

My data

This is the code I used with the aggregate function...

Data_Aggregate <- aggregate(cumulative ~ ID, Data, sum)

This works just fine and produces what I need. There is only one of each ID number and the cumulative rows have been summed. However, I am unsure of how to do this and leave the other columns intact. I tried using:

Fernald_Aggregate <- aggregate(cumulative ~ UC_ID   Oct, Fernald, sum)

to see what would happen and it automatically sums the cumulative by the Oct column, so there are duplicate IDs again, which is not what I need.

CodePudding user response：

You can cbind 2 aggregates. The first does the summation of one column, the second makes sure you don't lose data in the others. [,-3] deletes the duplicated group column (id) from the second aggregate.

cbind(aggregate( cumulative ~ id, df, sum ), 
  aggregate( df[,c("meter","dist")], by=list(df$id), c ))[,-3]
     id cumulative meter   dist
1 20010        345     1     10
2 20020       1323  2, 3 20, 30
3 20030        879     4     40
4 20035         23     5     50
5 20036         34     6     60
6 20040        157  7, 8 70, 80

toy data

df <- structure(list(id = c(20010, 20020, 20020, 20030, 20035, 20036, 
20040, 20040), meter = c(1, 2, 3, 4, 5, 6, 7, 8), dist = c(10, 
20, 30, 40, 50, 60, 70, 80), cumulative = c(345, 456, 867, 879, 
23, 34, 23, 134)), class = "data.frame", row.names = c(NA, -8L
))

df
     id meter dist cumulative
1 20010     1   10        345
2 20020     2   20        456
3 20020     3   30        867
4 20030     4   40        879
5 20035     5   50         23
6 20036     6   60         34
7 20040     7   70         23
8 20040     8   80        134

CodePudding user response：

You may use ave with FUN=sum to add the values of a given column by group, and remove duplicatets after. Consider following example, where "X3" is the cumulative column.

df
#   ID   X1   X2   X3   X4
# 1  1 0.91 0.74 0.93 0.47
# 2  2 0.94 0.13 0.26 0.56
# 3  2 0.94 0.13 0.99 0.56
# 4  3 0.29 0.66 0.46 0.90
# 5  4 0.83 0.71 0.94 0.14
# 6  4 0.83 0.71 0.99 0.14
# 7  5 0.64 0.46 0.98 0.99

(df <- transform(df, X3=ave(X3, ID, FUN=sum)))
#   ID   X1   X2   X3   X4
# 1  1 0.91 0.74 0.93 0.47
# 2  2 0.94 0.13 5.00 0.56
# 3  2 0.94 0.13 5.00 0.56
# 4  3 0.29 0.66 0.46 0.90
# 5  4 0.83 0.71 7.72 0.14
# 6  4 0.83 0.71 7.72 0.14
# 7  5 0.64 0.46 0.98 0.99

(df <- df[!duplicated(df[-c(1, 4)]), ])
#   ID   X1   X2   X3   X4
# 1  1 0.91 0.74 0.93 0.47
# 2  2 0.94 0.13 1.25 0.56
# 4  3 0.29 0.66 0.46 0.90
# 5  4 0.83 0.71 1.93 0.14
# 7  5 0.64 0.46 0.98 0.99

Maybe you need FUN=function(x) sum(s, na.rm=TRUE) with your real data.

Data:

df <- structure(list(ID = c(1L, 2L, 2L, 3L, 4L, 4L, 5L), X1 = c(0.91, 
0.94, 0.94, 0.29, 0.83, 0.83, 0.64), X2 = c(0.74, 0.13, 0.13, 
0.66, 0.71, 0.71, 0.46), X3 = c(0.93, 0.26, 0.99, 0.46, 0.94, 
0.99, 0.98), X4 = c(0.47, 0.56, 0.56, 0.9, 0.14, 0.14, 0.99)), row.names = c(NA, 
-7L), class = "data.frame")