Cumulative sum of unique values based on multiple criteria-CodePudding

I've got a df with multiple columns containing information of species sightings over the years in different sites, therefore each year might show multiple records. I would like to filter my df and calculate some operations based on certain columns, but I'd like to keep all columns for further analyses. I had some previous code using summarise but as I would like to keep all columns I was trying to avoid using it.

Let's say the columns I'm interested to work with at the moment are as follows:

df <- data.frame("Country" = LETTERS[1:5], "Site"=LETTERS[6:10], "species"=1:5, "Year"=1981:2010)

I would like to calculate:

1- The cumulative sum of the records in which a species has been documented within each site creating a new column "Spsum". 2- The number of different years that each species has been seen on a particular site, this could be done as cumulative sum as well, on a new column "nYear".

For example, if species 1 has been recorded 5 times in 1981, and 2 times in 1982 in Site G, Spsum would show 7 (cumulative sum of records) whereas nYear would show 2 as it was spotted over two different years. So far I've got this, but nYear is displaying 0s as a result.

 Df1 <- df %>%
    filter(Year>1980)%>%
    group_by(Country, Site, Species, Year) %>%
    mutate(nYear = n_distinct(Year[Species %in% Site]))%>%
    ungroup()

Thanks!

CodePudding user response：

This cound help, without the need for a join.

df %>% arrange(Country, Site, species, Year) %>% 
  filter(Year>1980) %>%
  group_by(Site, species) %>%
  mutate(nYear = length(unique(Year))) %>% 
  mutate(spsum = rowid(species))

# A tibble: 30 x 6
# Groups:   Site, species [5]
   Country Site  species  Year nYear spsum
   <chr>   <chr>   <int> <int> <int> <int>
 1 A       F           1  1981     6     1
 2 A       F           1  1986     6     2
 3 A       F           1  1991     6     3
 4 A       F           1  1996     6     4
 5 A       F           1  2001     6     5
 6 A       F           1  2006     6     6
 7 B       G           2  1982     6     1
 8 B       G           2  1987     6     2
 9 B       G           2  1992     6     3
10 B       G           2  1997     6     4
# ... with 20 more rows

CodePudding user response：

If the table contains multiple records per Country Site species Year combination, I would first aggregate those and then calculate the cumulative counts from that. The counts can then be joined back to the original table.

Something along these lines:

cumulative_counts <- df %>%
  count(Country, Site, species, Year) %>%
  group_by(Country, Site, species) %>%
  arrange(Year) %>%
  mutate(Spsum = cumsum(n), nYear = row_number())

df %>%
  left_join(cumulative_counts)