Home > Enterprise >  Which function summarises rows if shared value, depending on the column?
Which function summarises rows if shared value, depending on the column?

Time:08-25

I'm an amateur user of R and I'm dealing with a data transformation issue.

I want to sum up rows if same ID. This is, only one row per ID will be given as output. Also, I want the rest of the columns summarised depending on the column.

For the column Year, I want to keep the maximum value found on the same ID rows. For the column Event, I want to keep Economical/Natural if only those values are found. However, in the ID 2 both values (Economical and Natural) are found. In this case, I want the summary of the column Event to have the value "Both_present". For the column Wood, I want to keep Na/0/1 depending on which value is found. However, in some IDs, 0 and 1 are found at the same time. When this happens, I would like to keep the value 1. Finally, for the Nature column, I want to mantain Biotic or Abiotic depending on which is found. Also, like in the case of the column Events, if both values are found at the same ID, I want to express this as "Both_present"

Here you can find an example data, and how I would like to have the output:

ID <- c("1", "1", "2", "2", "2", "3", "3", "3",
        "4", "5", "5", "6", "6", "6")

Year <- c("2001", "2001", "2008", "2009", "2008", "2005", "2005", "2005",
          "2000", "2010", "2010", "2008", "2007", "2006")

Event <- c("Economical", "Economical", "Natural", "Economical", "Natural", "Natural", "Natural", "Natural",
           "Economical", "Economical", "Economical", "Natural", "Natural", "Natural")

Wood <- c("NA", "NA", "0", "1", "1", "1", "0", "0",
          "1", "1", "1", "1", "1", "1")

Nature <- c("Biotic", "Abiotic", "Biotic", "Biotic", "Abiotic", "Biotic", "Biotic", "Biotic",
           "Abiotic", "Abiotic", "Abiotic", "Biotic", "Biotic", "Biotic")

history <- data.frame(ID, Year, Event, Wood, Nature)


ID Year      Event Wood  Nature
1   1 2001 Economical   NA  Biotic
2   1 2001 Economical   NA Abiotic
3   2 2008    Natural    0  Biotic
4   2 2009 Economical    1  Biotic
5   2 2008    Natural    1 Abiotic
6   3 2005    Natural    1  Biotic
7   3 2005    Natural    0  Biotic
8   3 2005    Natural    0  Biotic
9   4 2000 Economical    1 Abiotic
10  5 2010 Economical    1 Abiotic
11  5 2010 Economical    1 Abiotic
12  6 2008    Natural    1  Biotic
13  6 2007    Natural    1  Biotic
14  6 2006    Natural    1  Biotic

Here is how the output should look like:

ID2 Year2       Event2 Wood2      Nature2
1   1  2001   Economical    NA Both_present
2   2  2009 Both_present     1 Both_present
3   3  2005      Natural     1       Biotic
4   4  2000   Economical     1      Abiotic
5   5  2010   Economical     1      Abiotic
6   6  2008      Natural     1       Biotic

CodePudding user response:

data.frame(
  ID2=unique(history$ID),
  Year2=tapply(as.numeric(history$Year),history$ID,max),
  Event2=tapply(history$Event,history$ID,\(x){u=unique(x);if(length(u)==2)u="Both present";u}),
  Wood2=tapply(history$Wood,history$ID,\(x){u=unique(x);if(length(u)==2)u=1;u}),
  Nature2=tapply(history$Nature,history$ID,\(x){u=unique(x);if(length(u)==2)u="Both present";u})
)

  ID2 Year2       Event2 Wood2      Nature2
1   1  2001   Economical    NA Both present
2   2  2009 Both present     1 Both present
3   3  2005      Natural     1       Biotic
4   4  2000   Economical     1      Abiotic
5   5  2010   Economical     1      Abiotic
6   6  2008      Natural     1       Biotic

You can also use aggregate, even though in some cases it's a lot slower than tapply, split.data.frame, or split:

m=matrix(rnorm(1e7),1e3)
v=round(runif(1e3)*1e2)
df=as.data.frame(m)

microbenchmark(times=10,
  split.data.frame=sapply(split.data.frame(m,v),colMeans),
  split.data.frame_for_dataframe=sapply(split.data.frame(df,v),colMeans),
  aggregate={a=aggregate(m,list(v),mean);`rownames<-`(a[,-1],a[,1])},
  split=t(sapply(split(1:nrow(m),v),\(x)colMeans(m[x,,drop=F]))),
  tapply={ncol=ncol(m);o=tapply(asplit(m,1),v,\(x)colMeans(matrix(unlist(x),,ncol)));`rownames<-`(matrix(unlist(o),length(o)),names(o))}
)

                           expr        min         lq       mean     median         uq        max neval
               split.data.frame   151.1025   152.6566   172.6936   154.4840   197.8200   221.5134    10
 split.data.frame_for_dataframe  9247.5892  9430.0372  9535.6191  9570.4078  9638.5558  9796.0167    10
                      aggregate 10573.4796 10859.2785 10955.4854 11006.6462 11082.8503 11163.1835    10
                          split   157.1666   157.7744   164.2297   159.7609   163.4930   200.6617    10
                         tapply   529.2891   564.2083   614.6336   573.8881   678.4606   800.3653    10

CodePudding user response:

Here is another approach in tidyverse. Let me know if this contains the correct logic you need.

library(tidyverse)

history |>
  group_by(ID) |>
  summarise(
    Year = max(Year),
    Event = ifelse(all(c("Economical", "Natural") %in% Event), "Both_present", Event),
    Wood = ifelse(any(Wood == "1"), "1", Wood),
    Nature = ifelse(all(c("Abiotic", "Biotic") %in% Nature), "Both_present", Nature)
  )

Output

  ID    Year  Event        Wood  Nature      
  <chr> <chr> <chr>        <chr> <chr>       
1 1     2001  Economical   NA    Both_present
2 2     2009  Both_present 1     Both_present
3 3     2005  Natural      1     Biotic      
4 4     2000  Economical   1     Abiotic     
5 5     2010  Economical   1     Abiotic     
6 6     2008  Natural      1     Biotic 
  • Related