Home > database >  Combining multiple observations into one column in R
Combining multiple observations into one column in R

Time:01-24

I'm working with a Covid19 dataset in which each row contains Covid data (tests, positives, negatives, deaths, etc...) for a particular day in a particular country. This means there are multiple rows for each day since every country's data gets its own row. I'm trying to generalize the data to only include one row per day per continent. Is there a simple way to sum all columns where the date is the same?

For example, I'd want to go from a table like this...

Date Continent Country Positives
2020-02-05 Europe United Kingdom 10
2020-02-05 Europe Poland 5
2020-02-05 Europe Sweden 0
2020-02-06 Europe United Kingdom 12
2020-02-05 Europe Poland 7
2020-02-05 Europe Sweden 1

to one like this...

Date Continent Positives
2020-02-05 Europe 15
2020-02-06 Europe 20

The closest I've gotten is

covid19EU <- covid19 %>% filter(Continent_Name == "Europe") %>% group_by(Date) %>% summarise_all(max)

but this returns the highest value instead of summing the value over all observations for the same date, ie

Date Continent Country Positives
2020-02-05 Europe United Kingdom 10
2020-02-06 Europe United Kingdom 12

CodePudding user response:

covid19 %>%
  group_by(Date, Continent) %>%
  summarize(across(where(is.numeric), sum))

Result

  Date       Continent Positives Something_else
  <chr>      <chr>         <int>          <int>
1 2020-02-05 Europe           15              6
2 2020-02-06 Europe           20             15

If there's only one column of data to sum, you can use the count(wt = ...) shortcut:

covid19 %>% 
  count(Date, Continent, wt = Positives, name = "Positives")

#        Date Continent Positives
#1 2020-02-05    Europe        15
#2 2020-02-06    Europe        20

Example data with another column of numeric data to sum (possibly implied by OP question "Is there a simple way to sum all columns where the date is the same?")

covid19 <- data.frame(
  stringsAsFactors = FALSE,
              Date = c("2020-02-05","2020-02-05", "2020-02-05",
                       "2020-02-06","2020-02-06","2020-02-06"), 
                      # NOTE UPDATED DATES VS OP
         Continent = c("Europe","Europe","Europe",
                       "Europe","Europe","Europe"),
           Country = c("United Kingdom","Poland",
                       "Sweden","United Kingdom","Poland","Sweden"),
         Positives = c(10L, 5L, 0L, 12L, 7L, 1L),
         Something_else = 1:6
)

CodePudding user response:

in base R:

aggregate(Positives~Date Continent,covid19, sum)

        Date Continent Positives
1 2020-02-05    Europe        15
2 2020-02-06    Europe        20
  • Related