Home > Blockchain >  How to remove extra values from the result of aggregate in R
How to remove extra values from the result of aggregate in R

Time:03-20

I have this data set.

##   X    ID sector meters Dist Oct ResFromMo ResFromYr ResToMo ResToYr YRLIVED
## 1 1 20100    H38   6400 6400   W         8      1979       5    1991      12
## 2 2 20101    B01   1600 1600  NW         5      1980       4    1991      NA
## 3 3 20102    H04   1600 1600  SW         6      1972       8    1979      NA
## 4 4 20103    B09   3200 3200  NE         1      1982       1    1984      NA
## 5 5 20103    B37   8000 8000   N         1      1984       4    1986       2
##      ifDate ExpFromYr ExpToYr ExpYrs Yr1951 Yr1952 Yr1953 Yr1954 Yr1955 Yr1956
## 1 fr51 > 88      1979    1988      9     NA     NA     NA     NA     NA     NA
## 2 fr51 > 88      1980    1988      8     NA     NA     NA     NA     NA     NA
## 3   between      1972    1979      7     NA     NA     NA     NA     NA     NA
## 4   between      1982    1984      2     NA     NA     NA     NA     NA     NA
## 5   between      1984    1986      2     NA     NA     NA     NA     NA     NA
##   Yr1957 Yr1958 Yr1959 Yr1960 Yr1961 Yr1962 Yr1963 Yr1964 Yr1965 Yr1966 Yr1967
## 1     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 2     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 3     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 4     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 5     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
##   Yr1968 Yr1969 Yr1970 Yr1971   Yr1972  Yr1973  Yr1974  Yr1975  Yr1976  Yr1977
## 1     NA     NA     NA     NA       NA      NA      NA      NA      NA      NA
## 2     NA     NA     NA     NA       NA      NA      NA      NA      NA      NA
## 3     NA     NA     NA     NA 34.06072 54.2693 54.2693 54.2693 54.2693 54.2693
## 4     NA     NA     NA     NA       NA      NA      NA      NA      NA      NA
## 5     NA     NA     NA     NA       NA      NA      NA      NA      NA      NA
##    Yr1978    Yr1979   Yr1980   Yr1981    Yr1982    Yr1983   Yr1984   Yr1985
## 1      NA  6.835951 4.340588 4.340588  4.340588  4.340588 4.340588 4.340588
## 2      NA        NA 3.160397 4.447229  4.447229  4.447229 4.447229 4.447229
## 3 54.2693 22.727323       NA       NA        NA        NA       NA       NA
## 4      NA        NA       NA       NA 16.018407 16.018407 1.305989       NA
## 5      NA        NA       NA       NA        NA        NA 3.052618 3.052618
##      Yr1986   Yr1987   Yr1988 arth_mean    median cumulative  Birthdate
## 1 4.3405881 4.340588 1.083782  4.264444  4.340588  42.644438  8/24/1979
## 2 4.4472289 4.447229 1.110409  3.933490  4.447229  35.401408   5/5/1980
## 3        NA       NA       NA 47.800478 54.269298 382.403825  3/17/1979
## 4        NA       NA       NA 11.114267 16.018407  33.342802 11/30/1981
## 5 0.8170605       NA       NA  2.307432  3.052618   6.922297 11/30/1981

I am working to aggregate the data down to remove to duplicate ID values and have gotten this far.

Fernald_New1 <- cbind(aggregate(cumulative ~ ID, Fernald, sum), aggregate(Fernald[,c("sector", "meters","Dist", "Oct", "ResFromMo", "ResFromYr", "ResToMo", "ResToYr", "YRLIVED", "ifDate", "ExpFromYr", "ExpToYr", "ExpYrs", "Yr1951", "Yr1952", "Yr1953", "Yr1954", "Yr1955", "Yr1956", "Yr1957", "Yr1958", "Yr1959", "Yr1960", "Yr1961", "Yr1962", "Yr1963", "Yr1964", "Yr1965", "Yr1966", "Yr1967", "Yr1968", "Yr1969", "Yr1970", "Yr1971", "Yr1972", "Yr1973", "Yr1974", "Yr1975", "Yr1976", "Yr1977", "Yr1978", "Yr1979", "Yr1980", "Yr1981", "Yr1982", "Yr1983", "Yr1984", "Yr1985", "Yr1986", "Yr1987", "Yr1988", "Birthdate")], by=list(Fernald$ID), c))[,-3] 

This results in the following:

head(Fernald_New1, n = 6)
##      ID cumulative   sector     meters       Dist    Oct ResFromMo  ResFromYr
## 1 20100   42.64444      H38       6400       6400      W         8       1979
## 2 20101   35.40141      B01       1600       1600     NW         5       1980
## 3 20102  382.40382      H04       1600       1600     SW         6       1972
## 4 20103   40.26510 B09, B37 3200, 8000 3200, 8000  NE, N      1, 1 1982, 1984
## 5 20104   40.26510 B09, B37 3200, 8000 3200, 8000  NE, N      1, 1 1982, 1984
## 6 20105  119.16543 B09, B09 3200, 3200 3200, 3200 NE, NE     12, 8 1980, 1983
##   ResToMo    ResToYr YRLIVED             ifDate  ExpFromYr    ExpToYr ExpYrs
## 1       5       1991      12          fr51 > 88       1979       1988      9
## 2       4       1991      NA          fr51 > 88       1980       1988      8
## 3       8       1979      NA            between       1972       1979      7
## 4    1, 4 1984, 1986   NA, 2   between, between 1982, 1984 1984, 1986   2, 2
## 5    1, 4 1984, 1986   NA, 2   between, between 1982, 1984 1984, 1986   2, 2
## 6    8, 3 1983, 1991  NA, NA between, fr51 > 88 1980, 1983 1983, 1988   3, 5
##   Yr1951 Yr1952 Yr1953 Yr1954 Yr1955 Yr1956 Yr1957 Yr1958 Yr1959 Yr1960 Yr1961
## 1     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 2     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 3     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 4 NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA
## 5 NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA
## 6 NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA
##   Yr1962 Yr1963 Yr1964 Yr1965 Yr1966 Yr1967 Yr1968 Yr1969 Yr1970 Yr1971
## 1     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 2     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 3     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 4 NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA
## 5 NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA
## 6 NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA NA, NA
##     Yr1972  Yr1973  Yr1974  Yr1975  Yr1976  Yr1977  Yr1978   Yr1979
## 1       NA      NA      NA      NA      NA      NA      NA 6.835951
## 2       NA      NA      NA      NA      NA      NA      NA       NA
## 3 34.06072 54.2693 54.2693 54.2693 54.2693 54.2693 54.2693 22.72732
## 4   NA, NA  NA, NA  NA, NA  NA, NA  NA, NA  NA, NA  NA, NA   NA, NA
## 5   NA, NA  NA, NA  NA, NA  NA, NA  NA, NA  NA, NA  NA, NA   NA, NA
## 6   NA, NA  NA, NA  NA, NA  NA, NA  NA, NA  NA, NA  NA, NA   NA, NA
##         Yr1980       Yr1981       Yr1982              Yr1983             Yr1984
## 1     4.340588     4.340588     4.340588            4.340588           4.340588
## 2     3.160397     4.447229     4.447229            4.447229           4.447229
## 3           NA           NA           NA                  NA                 NA
## 4       NA, NA       NA, NA 16.01841, NA        16.01841, NA 1.305989, 3.052618
## 5       NA, NA       NA, NA 16.01841, NA        16.01841, NA 1.305989, 3.052618
## 6 1.516534, NA 16.01841, NA 16.01841, NA 10.712448, 6.826448       NA, 16.01841
##         Yr1985        Yr1986       Yr1987       Yr1988              Birthdate
## 1     4.340588      4.340588     4.340588     1.083782              8/24/1979
## 2     4.447229      4.447229     4.447229     1.110409               5/5/1980
## 3           NA            NA           NA           NA              3/17/1979
## 4 NA, 3.052618 NA, 0.8170605       NA, NA       NA, NA 11/30/1981, 11/30/1981
## 5 NA, 3.052618 NA, 0.8170605       NA, NA       NA, NA     9/1/1978, 9/1/1978
## 6 NA, 16.01841  NA, 16.01841 NA, 16.01841 NA, 3.999565   12/3/1980, 12/3/1980

This works just fine, but I now need to remove the older year in the "ResToYr" column. For example, in row 4 of the output, the result is 1984, 1986. I only need to keep the second value for each column (in this case 1986). I have over 9,000 rows in my full dataset, so how would I do this?

CodePudding user response:

It is a list column as we are concatenating the values in the second aggregate. So we can loop over the list with sapply and get the last value with tail

Fernald_New1$ResToYr <- sapply(Fernald_New1$ResToYr, tail, 1)

With tidyverse, we have more flexibility to do multiple aggregations on separate blocks of columns by group

library(dplyr)
Fernald %>%
   group_by(ID) %>%
   summarise(cumulative = sum(cumulative, na.rm = TRUE),
             across(c("sector", "meters","Dist", "Oct", "ResFromMo",
            "ResFromYr", "ResToMo", 
      "YRLIVED", "ifDate", "ExpFromYr", "ExpToYr", "ExpYrs", 
          matches("^Yr\\d{4}$"), "Birthdate"), ~ list(.x)),
               ResToYr = last(ResToYr))
  • Related