I have over 14,000 rows of data and around 56 columns. One column is labeled as "ID." Some IDS (such as 20103) have duplicates since a person may have multiple entries. I wanted to combine these duplicate IDs by summing the cumulative column, but leaving the other columns intact. This is my data below. I modified it to remove the 14,000 rows and 56 columns for simplicity sake.
This is the code I used with the aggregate function...
Data_Aggregate <- aggregate(cumulative ~ ID, Data, sum)
This works just fine and produces what I need. There is only one of each ID number and the cumulative rows have been summed. However, I am unsure of how to do this and leave the other columns intact. I tried using:
Fernald_Aggregate <- aggregate(cumulative ~ UC_ID Oct, Fernald, sum)
to see what would happen and it automatically sums the cumulative by the Oct column, so there are duplicate IDs again, which is not what I need.
CodePudding user response:
You can cbind
2 aggregate
s. The first does the summation of one column, the second makes sure you don't lose data in the others. [,-3]
deletes the duplicated group column (id) from the second aggregate.
cbind(aggregate( cumulative ~ id, df, sum ),
aggregate( df[,c("meter","dist")], by=list(df$id), c ))[,-3]
id cumulative meter dist
1 20010 345 1 10
2 20020 1323 2, 3 20, 30
3 20030 879 4 40
4 20035 23 5 50
5 20036 34 6 60
6 20040 157 7, 8 70, 80
toy data
df <- structure(list(id = c(20010, 20020, 20020, 20030, 20035, 20036,
20040, 20040), meter = c(1, 2, 3, 4, 5, 6, 7, 8), dist = c(10,
20, 30, 40, 50, 60, 70, 80), cumulative = c(345, 456, 867, 879,
23, 34, 23, 134)), class = "data.frame", row.names = c(NA, -8L
))
df
id meter dist cumulative
1 20010 1 10 345
2 20020 2 20 456
3 20020 3 30 867
4 20030 4 40 879
5 20035 5 50 23
6 20036 6 60 34
7 20040 7 70 23
8 20040 8 80 134
CodePudding user response:
You may use ave
with FUN=sum
to add the values of a given column by group, and remove duplicatet
s after. Consider following example, where "X3"
is the cumulative column.
df
# ID X1 X2 X3 X4
# 1 1 0.91 0.74 0.93 0.47
# 2 2 0.94 0.13 0.26 0.56
# 3 2 0.94 0.13 0.99 0.56
# 4 3 0.29 0.66 0.46 0.90
# 5 4 0.83 0.71 0.94 0.14
# 6 4 0.83 0.71 0.99 0.14
# 7 5 0.64 0.46 0.98 0.99
(df <- transform(df, X3=ave(X3, ID, FUN=sum)))
# ID X1 X2 X3 X4
# 1 1 0.91 0.74 0.93 0.47
# 2 2 0.94 0.13 5.00 0.56
# 3 2 0.94 0.13 5.00 0.56
# 4 3 0.29 0.66 0.46 0.90
# 5 4 0.83 0.71 7.72 0.14
# 6 4 0.83 0.71 7.72 0.14
# 7 5 0.64 0.46 0.98 0.99
(df <- df[!duplicated(df[-c(1, 4)]), ])
# ID X1 X2 X3 X4
# 1 1 0.91 0.74 0.93 0.47
# 2 2 0.94 0.13 1.25 0.56
# 4 3 0.29 0.66 0.46 0.90
# 5 4 0.83 0.71 1.93 0.14
# 7 5 0.64 0.46 0.98 0.99
Maybe you need FUN=function(x) sum(s, na.rm=TRUE)
with your real data.
Data:
df <- structure(list(ID = c(1L, 2L, 2L, 3L, 4L, 4L, 5L), X1 = c(0.91,
0.94, 0.94, 0.29, 0.83, 0.83, 0.64), X2 = c(0.74, 0.13, 0.13,
0.66, 0.71, 0.71, 0.46), X3 = c(0.93, 0.26, 0.99, 0.46, 0.94,
0.99, 0.98), X4 = c(0.47, 0.56, 0.56, 0.9, 0.14, 0.14, 0.99)), row.names = c(NA,
-7L), class = "data.frame")