Home > Software engineering >  Generate individual data distributions using mean and standard deviation data from a data frame in R
Generate individual data distributions using mean and standard deviation data from a data frame in R

Time:05-03

I have a data.frame in R, containing several categorical variables, each with its own mean and standard deviation. I want to generate values from a normal data distribution for each categorical variable defined by these values and generate individual data.frames for each discrete categorical variable.

Here's some dummy data

dummy_data <- data.frame(VARIABLE = LETTERS[seq( from = 1, to = 10 )],
                         MEAN = runif(10, 5, 10), SD = runif(10, 1, 3))

dummy_data

   VARIABLE     MEAN       SD
1         A 6.278751 1.937093
2         B 6.384247 2.487678
3         C 9.017496 2.003202
4         D 5.125994 1.829517
5         E 9.525213 1.914513
6         F 9.004893 2.734934
7         G 9.780757 2.511341
8         H 5.372160 1.510281
9         I 6.240331 2.796826
10        J 8.478280 2.325139

What I'd like to do from here, is to generate individual data.frames for each row, with each data.frame containing a normal distribution based on the MEAN and SD columns.

So, for example, I'd have a separate data.frame that contained....

A <- subset(dummy_data, VARIABLE == 'A')
A <- data.frame(rnorm(20,  A$MEAN, A$SD))

A

   rnorm.20..A.MEAN..A.SD.
1                 5.131331
2                 9.388104
3                 8.909453
4                 5.813257
5                 5.353137
6                 7.598521
7                 2.693924
8                 5.425703
9                 8.939687
10                9.148066
11                4.528936
12                7.576479
13                8.207456
14                6.838258
15                6.972061
16                7.824283
17                6.283434
18                4.503815
19                2.133388
20                7.472886

The real data I'm working with is much larger than ten rows, and so I don't want to subset the whole thing to generate the individual data.frames if I can help it.

Thanks in advance

CodePudding user response:

What about a solution using dplyr?:

library(dplyr)

#A dataframe containing all the information
Huge_df <- dummy_data %>% group_by(VARIABLE) %>% summarise(SIMULATED = rnorm(20, MEAN, SD))

#You can then split the dataframe if needed:
Splitted <- split.data.frame(Huge_df, "VARIABLE")

If you then need to save every individual dataframe, or do something else with them, you can always unlist the Splitted object

CodePudding user response:

Using data.table:

library(data.table)
result     <- setDT(dummy_data)[, .(sample=rnorm(20, mean=MEAN, sd=SD)), by=.(VARIABLE)]
list.of.df <- split(result, result$VARIABLE)

CodePudding user response:

You can put everything into a list, then return all the elements in the list to the global environment (if desired, or keep in the list):

set.seed(123)
dummy_data <- data.frame(VARIABLE = LETTERS[seq( from = 1, to = 10 )],
                         MEAN = runif(10, 5, 10), SD = runif(10, 1, 3))

# put all the values into a list
list_dist <- vector(mode = "list", length = nrow(dummy_data))
for(i in 1:nrow(dummy_data)){
  list_dist[[i]] <- data.frame(values = rnorm(20, dummy_data[i,2], dummy_data[i,3]))
}
# name the list elements 
names(list_dist) <- dummy_data$VARIABLE

# or more detailed names, for instance, 
# names(list_dist) <- paste0(dummy_data$VARIABLE, "_Distribution")

#return all list values to the global environment
list2env(list_dist,globalenv())
  • Related