For each observation, there are three volume values, volume_1, volume_2, and volume_3. Sometimes volume_1 is the biggest while sometimes volume_3 is the biggest. Although I feel like this code should be easy, how do I calculate the overall percentage of the time when volume_1 is the biggest, volume_2 is the biggest, and then volume_3 is the biggest.
Here is a little bit of my data
structure(list(PVC = c("29A", "2D5", "2HX", "38A", "3CN", "6021051"
), Age = c(6, 5, 6, 10, 5, 7), volume_1 = c(59.44244884, 51.69518257,
63.17950819, 56.4269955, 64.05189184, 61.82983473), volume_2 = c(54.74897726,
56.64778447, 51.86880673, 59.71146472, 58.96633234, 63.39471043
), volume_3 = c(58.97596791, 51.75711362, 61.36142512, 57.09629745,
64.25164825, 63.19407463), Year = c("2003", "2002", "2003", "2008",
"2003", "1994"), averageAB = c(57.09571305, 54.17148352, 57.52415746,
58.06923011, 61.50911209, 62.61227258), diff_AB_C = c(-1.88025486,
2.4143699, -3.83726766, 0.972932659999998, -2.74253616, -0.58180205
), sumAB = c(114.1914261, 108.34296704, 115.04831492, 116.13846022,
123.01822418, 125.22454516), ratioAB_C = c(1.93623657477333,
2.09329615703558, 1.87492899154491, 2.03408041163622, 1.91463141461122,
1.98158681637776)), row.names = c(NA, 6L), class = "data.frame")
Thank you as always for your help!!
CodePudding user response:
This is pretty easily done using the max.col()
function:
df$highest_col <- max.col(df[,3:5])
head(df)
PVC Age volume_1 volume_2 volume_3 Year averageAB
1 29A 6 59.44245 54.74898 58.97597 2003 57.09571
2 2D5 5 51.69518 56.64778 51.75711 2002 54.17148
3 2HX 6 63.17951 51.86881 61.36143 2003 57.52416
4 38A 10 56.42700 59.71146 57.09630 2008 58.06923
5 3CN 5 64.05189 58.96633 64.25165 2003 61.50911
6 6021051 7 61.82983 63.39471 63.19407 1994 62.61227
diff_AB_C sumAB ratioAB_C highest_col
1 -1.8802549 114.1914 1.936237 1
2 2.4143699 108.3430 2.093296 2
3 -3.8372677 115.0483 1.874929 1
4 0.9729327 116.1385 2.034080 2
5 -2.7425362 123.0182 1.914631 3
6 -0.5818020 125.2245 1.981587 2
Then you can just run table()
on that new column to get the counts.
CodePudding user response:
There may be an easier way but using tidyverse I'd go
df %>%
rowwise() %>%
mutate(max = case_when(max(volume_1, volume_2, volume_3) == volume_1 ~ "volume_1",
max(volume_1, volume_2, volume_3) == volume_2 ~ "volume_2",
max(volume_1, volume_2, volume_3) == volume_3 ~ "volume_3") %>%
ungroup() %>%
count(max)
CodePudding user response:
data.table
solution.
It finds the name of the column (it's name starting with "volums") thet contains the maximum value of each row. This is stord in te maxCol column.
After that,m using table
and prop.table
to get the percentages.
library(data.table)
setDT(mydata)
mydata[, maxCol := names(.SD)[max.col(.SD)], .SDcols = patterns("^volume")][]
prop.table(table(mydata$maxCol))
# volume_1 volume_2 volume_3
# 0.3333333 0.5000000 0.1666667
of in a oneliner
prop.table(table(setDT(mydata)[, .(col = names(.SD)[max.col(.SD)]), .SDcols = patterns("^volume")]))
CodePudding user response:
vol <- df %>% select(starts_with("volume"))
vol$max <- colnames(vol)[max.col(vol)]
vol <-
vol%>%group_by(max)%>%
mutate(rep = n(),
prop = 100* rep/nrow(df)) %>%
ungroup()
left_join(df,vol)
Joining, by = c("volume_1", "volume_2", "volume_3")
PVC Age volume_1 volume_2 volume_3 Year averageAB diff_AB_C sumAB ratioAB_C
1 29A 6 59.44245 54.74898 58.97597 2003 57.09571 -1.8802549 114.1914 1.936237
2 2D5 5 51.69518 56.64778 51.75711 2002 54.17148 2.4143699 108.3430 2.093296
3 2HX 6 63.17951 51.86881 61.36143 2003 57.52416 -3.8372677 115.0483 1.874929
4 38A 10 56.42700 59.71146 57.09630 2008 58.06923 0.9729327 116.1385 2.034080
5 3CN 5 64.05189 58.96633 64.25165 2003 61.50911 -2.7425362 123.0182 1.914631
6 6021051 7 61.82983 63.39471 63.19407 1994 62.61227 -0.5818020 125.2245 1.981587
max rep prop
1 volume_1 2 33.33333
2 volume_2 3 50.00000
3 volume_1 2 33.33333
4 volume_2 3 50.00000
5 volume_3 1 16.66667
6 volume_2 3 50.00000
>
CodePudding user response:
Extract the volume columns, rank the columns in each row and then find the fraction of rank 3 for each. Multiply that by 100 if percentages are wanted.
How to handle ties was not defined so we used ties="first"
. See ?rank
for other possibilities.
rowMeans(apply(DF[grep("volume", names(DF))], 1, rank, ties = "first") == 3)
## volume_1 volume_2 volume_3
## 0.3333333 0.5000000 0.1666667
Here is a variation that gives the same result if there are no ties.
vol <- DF[grep("volume", names(DF))]
colMeans(vol == apply(vol, 1, max))
## volume_1 volume_2 volume_3
## 0.3333333 0.5000000 0.1666667