Home > Back-end >  Looping a function/analysis in R by unique column values
Looping a function/analysis in R by unique column values

Time:05-12

I am trying to sort people in my dataset into three body-size categories (Small, Medium, Large). Consider the toy example below:

library(ggplot2)
library(dplyr)

# Toy dataset
Data<-data.frame(Age=c(40,40,40,41,41,41,42,42,42),
                 Height=c(180,179,178,177,176,175,174,173,172),
                 Weight=c(84,83,82,81,80,79,78,77,76))

# Classify people as Small, Medium, or Large
Data$Size<-Data$Height Data$Weight
Data$Sizerank<-rank(Data$Size)
Data$Sizegroup<-as.numeric(cut_number(Data$Sizerank,3))
Data$CohortL<-ifelse(Data$Sizegroup==3,"Large",NA)
Data$CohortM<-ifelse(Data$Sizegroup==2,"Medium",NA)
Data$CohortS<-ifelse(Data$Sizegroup==1,"Small",NA)
temp1<-as.vector(Data$CohortL)
temp2<-as.vector(Data$CohortM)
temp3<-as.vector(Data$CohortS)
temp4<-data.frame(temp1,temp2,temp3)
temp5<-temp4%>%mutate(Cohort=coalesce(temp1,temp2,temp3))
Data$Cohort<-temp5$Cohort
Data<-data.frame(Data$Age,
                 Data$Height,
                 Data$Weight,
                 Data$Cohort)
colnames(Data)<-c("Age","Height","Weight","Cohort")

# Remove temporary files from workspace
rm(temp1,
   temp2,
   temp3,
   temp4,
   temp5)

# Print Data
Data

This code quantifies whether people are "Small" (bottom 1/3rd), "Medium" (middle 1/3rd), or "Large" (top 1/3rd), as compared to the whole dataset.

I would like to expand this code to perform the size ranking/grouping separately for each age group. So for example, ranking all 40-year-olds as to whether they are Small, Medium, or Large compared to other 40-year-olds, not the population at large. Ranking separately for each age group would clearly change the Cohort membership, in this case from Large/Large/Large/Medium/Medium/Medium/Small/Small/Small to Large/Medium/Small/Large/Medium/Small/Large/Medium/Small

If I only had three age groups then I would just run this analysis manually, but I have a much wider age range than this in practice, so I think that I need some sort of looping function, maybe a for loop or one of the apply() functions?

Any help or insights would be greatly appreciated. Thank you very much.

P.S. I am also aware that my method of contructing the "Cohort" column is cumbersome, so if anybody knows of a more elegant approach to this then I would be very happy to learn about it.

CodePudding user response:

How about this:

library(dplyr)

Data %>% 
  group_by(Age) %>% 
  mutate(size = gtools::quantcut(I(Height   Weight), 
                                 q=3, 
                                 labels=c("Small", "Medium", "Large")))
#> # A tibble: 9 × 5
#> # Groups:   Age [3]
#>     Age Height Weight Cohort size  
#>   <dbl>  <dbl>  <dbl> <chr>  <fct> 
#> 1    40    180     84 Large  Large 
#> 2    40    179     83 Large  Medium
#> 3    40    178     82 Large  Small 
#> 4    41    177     81 Medium Large 
#> 5    41    176     80 Medium Medium
#> 6    41    175     79 Medium Small 
#> 7    42    174     78 Small  Large 
#> 8    42    173     77 Small  Medium
#> 9    42    172     76 Small  Small

CodePudding user response:

I dont know if i understand exactly... But, try this.

Obs: You need to install Hmisc package.

Data2 <- Data %>%
      mutate(Size = Age   Height   Weight) %>%
      group_by(Age) %>%
      mutate(Cohort_groups = as.numeric(Hmisc::cut2(Size, g=3))) %>%
      mutate(Cohort = case_when(
        Cohort_groups  == 3 ~ "Large",
        Cohort_groups == 2 ~ "Medium",
        Cohort_groups == 1 ~ "Small")) %>%
      select(-Cohort_groups)
      
   Data2
  • Related