I'm an amateur user of R and I'm dealing with a data transformation issue.
I want to sum up rows if same ID. This is, only one row per ID will be given as output. Also, I want the rest of the columns summarised depending on the column.
For the column Year, I want to keep the maximum value found on the same ID rows. For the column Event, I want to keep Economical/Natural if only those values are found. However, in the ID 2 both values (Economical and Natural) are found. In this case, I want the summary of the column Event to have the value "Both_present". For the column Wood, I want to keep Na/0/1 depending on which value is found. However, in some IDs, 0 and 1 are found at the same time. When this happens, I would like to keep the value 1. Finally, for the Nature column, I want to mantain Biotic or Abiotic depending on which is found. Also, like in the case of the column Events, if both values are found at the same ID, I want to express this as "Both_present"
Here you can find an example data, and how I would like to have the output:
ID <- c("1", "1", "2", "2", "2", "3", "3", "3",
"4", "5", "5", "6", "6", "6")
Year <- c("2001", "2001", "2008", "2009", "2008", "2005", "2005", "2005",
"2000", "2010", "2010", "2008", "2007", "2006")
Event <- c("Economical", "Economical", "Natural", "Economical", "Natural", "Natural", "Natural", "Natural",
"Economical", "Economical", "Economical", "Natural", "Natural", "Natural")
Wood <- c("NA", "NA", "0", "1", "1", "1", "0", "0",
"1", "1", "1", "1", "1", "1")
Nature <- c("Biotic", "Abiotic", "Biotic", "Biotic", "Abiotic", "Biotic", "Biotic", "Biotic",
"Abiotic", "Abiotic", "Abiotic", "Biotic", "Biotic", "Biotic")
history <- data.frame(ID, Year, Event, Wood, Nature)
ID Year Event Wood Nature
1 1 2001 Economical NA Biotic
2 1 2001 Economical NA Abiotic
3 2 2008 Natural 0 Biotic
4 2 2009 Economical 1 Biotic
5 2 2008 Natural 1 Abiotic
6 3 2005 Natural 1 Biotic
7 3 2005 Natural 0 Biotic
8 3 2005 Natural 0 Biotic
9 4 2000 Economical 1 Abiotic
10 5 2010 Economical 1 Abiotic
11 5 2010 Economical 1 Abiotic
12 6 2008 Natural 1 Biotic
13 6 2007 Natural 1 Biotic
14 6 2006 Natural 1 Biotic
Here is how the output should look like:
ID2 Year2 Event2 Wood2 Nature2
1 1 2001 Economical NA Both_present
2 2 2009 Both_present 1 Both_present
3 3 2005 Natural 1 Biotic
4 4 2000 Economical 1 Abiotic
5 5 2010 Economical 1 Abiotic
6 6 2008 Natural 1 Biotic
CodePudding user response:
data.frame(
ID2=unique(history$ID),
Year2=tapply(as.numeric(history$Year),history$ID,max),
Event2=tapply(history$Event,history$ID,\(x){u=unique(x);if(length(u)==2)u="Both present";u}),
Wood2=tapply(history$Wood,history$ID,\(x){u=unique(x);if(length(u)==2)u=1;u}),
Nature2=tapply(history$Nature,history$ID,\(x){u=unique(x);if(length(u)==2)u="Both present";u})
)
ID2 Year2 Event2 Wood2 Nature2
1 1 2001 Economical NA Both present
2 2 2009 Both present 1 Both present
3 3 2005 Natural 1 Biotic
4 4 2000 Economical 1 Abiotic
5 5 2010 Economical 1 Abiotic
6 6 2008 Natural 1 Biotic
You can also use aggregate
, even though in some cases it's a lot slower than tapply
, split.data.frame
, or split
:
m=matrix(rnorm(1e7),1e3)
v=round(runif(1e3)*1e2)
df=as.data.frame(m)
microbenchmark(times=10,
split.data.frame=sapply(split.data.frame(m,v),colMeans),
split.data.frame_for_dataframe=sapply(split.data.frame(df,v),colMeans),
aggregate={a=aggregate(m,list(v),mean);`rownames<-`(a[,-1],a[,1])},
split=t(sapply(split(1:nrow(m),v),\(x)colMeans(m[x,,drop=F]))),
tapply={ncol=ncol(m);o=tapply(asplit(m,1),v,\(x)colMeans(matrix(unlist(x),,ncol)));`rownames<-`(matrix(unlist(o),length(o)),names(o))}
)
expr min lq mean median uq max neval
split.data.frame 151.1025 152.6566 172.6936 154.4840 197.8200 221.5134 10
split.data.frame_for_dataframe 9247.5892 9430.0372 9535.6191 9570.4078 9638.5558 9796.0167 10
aggregate 10573.4796 10859.2785 10955.4854 11006.6462 11082.8503 11163.1835 10
split 157.1666 157.7744 164.2297 159.7609 163.4930 200.6617 10
tapply 529.2891 564.2083 614.6336 573.8881 678.4606 800.3653 10
CodePudding user response:
Here is another approach in tidyverse
. Let me know if this contains the correct logic you need.
library(tidyverse)
history |>
group_by(ID) |>
summarise(
Year = max(Year),
Event = ifelse(all(c("Economical", "Natural") %in% Event), "Both_present", Event),
Wood = ifelse(any(Wood == "1"), "1", Wood),
Nature = ifelse(all(c("Abiotic", "Biotic") %in% Nature), "Both_present", Nature)
)
Output
ID Year Event Wood Nature
<chr> <chr> <chr> <chr> <chr>
1 1 2001 Economical NA Both_present
2 2 2009 Both_present 1 Both_present
3 3 2005 Natural 1 Biotic
4 4 2000 Economical 1 Abiotic
5 5 2010 Economical 1 Abiotic
6 6 2008 Natural 1 Biotic