Home > Mobile >  Applying a function to a dataframe where another dataframe contains the criteria for filtering and t
Applying a function to a dataframe where another dataframe contains the criteria for filtering and t

Time:02-17

I searched over the weekend for a solution to this problem but was unable to find one. I did end up making a script which I believe is much longer than it needs to be and surely there is a quicker way with a loop or function.

I would like to create a function that loops through each row of data frame 1 (which has been derived from a frequency table). The function will use filter() and sample_n() to select records from data frame 2. Therefore, data frame 1 will act as criteria for filtering and sampling for data frame 2.

Please see my code below which does not return the records I am searching for.

The correct result would be to return 1 record from group A (1910), 3 from group A (1930), 2 from group B (1930), 1 from group C (1940), 1 from group D (1940) and 2 from group C (1930) at random.

Cheers,

Daniel

require(dplyr)

FilterA <- c("A","B","A","B","C","D","C")
FilterB <- c(1910,1920,1930,1930,1940,1940,1930)
Frequency <- c(1,0,3,2,1,1,2)

df1 <- data.frame(FilterA, FilterB, Frequency)

df1$num <- paste(FilterA, FilterB, sep=" ")

ID <- c("A", "A", "A", "A", "A", "A", "A", "A",
        "B", "B", "B", "B",
        "C", "C", "C", "C", "C", "C",
        "D", "D", "D")

Year <- c(1910, 1910, 1920, 1930, 1930, 1930, 1940, 1940,
          1930, 1920, 1930, 1930,
          1910, 1940, 1910, 1910, 1930, 1930,
          1930, 1940, 1940)

df2 <- data.frame(ID, Year)

case.control <- function(datF1, datF2, na.rm=TRUE, ...){
  
  ID_list <- unique(datF1$num)
  
  for (i in seq_along(ID_list)){
    
    func <- filter(datF2, Year == datF1$FilterB & ID == datF1$FilterA) %>% sample_n(datF1$Frequency)

    func
    
    }
  
}

x <- case.control(df1, df2)

CodePudding user response:

Thanks for that reproducible example - if you did want to go row-by-row:

First, I've tidied some of your code, there are no real changes here:

# Your code----
id <- c("A", "B", "A", "B", "C", "D", "C")
year <- c(1910, 1920, 1930, 1930, 1940, 1940, 1930)
frequency <- c(1, 0, 3, 2, 1, 1, 2)

df_1 <- data.frame(id,
  year,
  frequency,
  row.names = NULL
)

df_1$num <- paste(id, year)

df_1 <- df_1 %>%
  filter(frequency != 0)

id <- c(
  "A", "A", "A", "A", "A", "A", "A", "A",
  "B", "B", "B", "B",
  "C", "C", "C", "C", "C", "C",
  "D", "D", "D"
)

year <- c(
  1910, 1910, 1920, 1930, 1930, 1930, 1940, 1940,
  1930, 1920, 1930, 1930,
  1910, 1940, 1910, 1910, 1930, 1930,
  1930, 1940, 1940
)

df_2 <- data.frame(id, year)

Now, moving on to actually obtaining random samples, using lapply(). You could use apply() to iterate over rows of a dataframe, but I personally find working with apply() confusing, so instead I'm turning df_1 into a list, making each row an object in it.

library(dplyr)
list_1 <- split(df_1, seq(nrow(df_1)))

I then use lapply() to iterate over each row with the following function:

# Option 1: lapply()----
random_records <- lapply(list_1, function(x) {
  df_records <- df_2 %>%
   # Matching up the years and id in df_2
    filter(year == x$year & id == x$id) %>%
    # Using the frequency with slice_sample(), sample_n() is also fine
    slice_sample(n = x$frequency)
})

# Then bind the list back together again into a dataframe
random_records <- bind_rows(random_records)

Or, another option that personally I prefer is using purrr's map_df(), because it returns a dataframe straight away.

# Option 2: purrr's map_df()
# I think this option is the neatest, because it returns a df immediately
library(purrr)
random_records <- map_df(list_1, function(x) {
  df_records <- df_2 %>%
    filter(year == x$year & id == x$id) %>%
    slice_sample(n = x$frequency)
})

CodePudding user response:

Instead of dealing with this on row by row basis you can join the two dataframes and randomly select rows based on Frequency column for each unique value of FilterA and FilterB.

library(dplyr)

df1 %>%
  filter(Frequency > 0) %>%
  left_join(df2, by = c('FilterA' = 'ID', 'FilterB' = 'Year')) %>%
  group_by(FilterA, FilterB) %>%
  sample_n(first(Frequency))

If you have all the information needed in final output in df1 itself then you can use uncount directly to expand the df1 dataset.

This works for the example shared because there is no new information in df2.

tidyr::uncount(df1, Frequency)
  • Related