Home > Software engineering >  Programming pattern for joining data from multiple data sets in R
Programming pattern for joining data from multiple data sets in R

Time:06-14

I was doing an exercise that asked me to compare the distributions of a plain distribution vs a mixed distribution. One of the tasks asked was to compare their means and standard deviations. To present my simple findings I wanted to use a table providing the summary statistics for each distribution. The following is what I created (I also included the data it came from)


set.seed(300)

binom_guid_obs = rbinom(n = 1000, size = 10, prob = 0.8) # binomial random variable
binom_guid_tbl = data.frame( "success" = binom_guid_obs)

probs_frm_beta = rbeta(n = 1000, shape1 = 4, shape2 = 1)
binom_beta_params_obs = rbinom(n = 1000, size = 10, prob = probs_frm_beta)
binom_beta_params_tbl = data.frame("success" = binom_beta_params_obs)
 #beta-binomial random variable

plain_binom_summ_stats = binom_guid_tbl %>% summarise("mean" = mean(success), "sd" = sd(success))
binom_beta_params_summ_stats = binom_beta_params_tbl %>%  summarise("mean" = mean(success), "sd" = sd(success))

binded_rows_plain_beta_binom = bind_rows(plain_binom_summ_stats, binom_beta_params_summ_stats)
binded_rows_plain_beta_binom = binded_rows_plain_beta_binom %>% mutate("name" = c("plain_binom", "binom_beta")) %>% select(name, 1:2)

As can be seen I successfully created the table, but I feel I had to do a lot of unnecessary fiddling to get the table presented. In particular the issue was being able to create a column for the "names" of the two data sets. Is there a simpler and cleaner programming pattern that I could use in a scenario like this that isn't as "clunky" as the current one? It seems like there surely should be because I'm not doing anything out of the ordinary. Just comparing distributions.

CodePudding user response:

Try this

library(dplyr , warn.conflicts = F)
set.seed(300)

binom_guid_obs = rbinom(n = 1000, size = 10, prob = 0.8) # binomial random variable
probs_frm_beta = rbeta(n = 1000, shape1 = 4, shape2 = 1)
binom_beta_params_obs = rbinom(n = 1000, size = 10, prob = probs_frm_beta) 

df <- data.frame(bisuccess = binom_guid_obs , bbsuccess = binom_beta_params_obs)

df %>% summarise(mean = c(mean(bisuccess) , mean(bbsuccess)) ,
                 sd = c(sd(bisuccess) , sd(bbsuccess))) -> df
rownames(df) <- c("plain_binom" , "binom_beta")

df
#>              mean       sd
#> plain_binom 7.963 1.298968
#> binom_beta  8.131 1.976802

In base R

x <- c(binom_guid_obs , binom_beta_params_obs)

y <- gl(2 , 1000 , labels = c("plain_binom" , "binom_beta"))

df <- cbind(tapply(x , y , mean) , tapply(x , y , sd))

colnames(df) <- c("mean" , "sd")

df

Created on 2022-06-14 by the reprex package (v2.0.1)

  • Related