Turning a subset based on multiple parameters into a looping function-CodePudding

I have a dataframe with 3000 items (each row is a new item with the item_id in the first column) with 2500 attributes in each subsequent column.

item_id    attribute_1    attribute_2 ... category_1 ... category_2
x3902      1              5               1              0
x1402      2              0               1              1

I am able to successfully grab a random list of 10 items that fulfil a given attribute and category:

subset_of_items = sample(subset(df$item_id, (df$attribute1 == 3 & df$category_2 == 1 )),10)

I need to repeat this for 30 combinations of attribute category. So I was hoping to create a function that loops and saves the output into a table where each table column is a list of 10 items that fulfils a set of conditions.

This is as far as I could get, but can't get it to successfully run:

df <- read.csv ("data.csv",sep=",")

attribute <- c("attribute_1", "attribute_2", "attribute_3")
value <- c(1, 2, 5)
category <- c("category_1", "category_1", "category_2")

function1 <- function(x, attribute_x, value_x, category_x)
{
  dftemp = subset(x[["item_id"]], (x[[attribute_x]] == value_x & x[[category_x]] == 1 ))
  output <- cbind(output, dftemp)
}

function1(df, attribute, value, category)

The cherry on the cake would be to name each of the columns as attribute_x .

CodePudding user response：

You can do the following (this uses a data.table approach)

Set your df to data.table

library(data.table)
df <- read.csv("data.csv",sep=",")
setDT(df)

Make a revised function, called get_subset() that returns a vector of item_ids that meet the attribute and category requirements, using get(). It also checks if you have size items return, and pads with as.character(NA) to ensure that the returned vector has length = size

get_subset <- function(x, attribute_x, value_x, category_x, size=10)
{
  subset = x[get(attribute_x)==value_x & get(category_x) == 1, .(item_id)] %>% 
    .[sample(1:.N, size=min(size, .N)), item_id]
  if(length(subset)<size) subset=c(subset,rep(as.character(NA),size-length(subset)))
  return(subset)
}

Put your combinations of interest into a small data.table

combinations = data.table(
  attribute = c("attribute_1", "attribute_2"),
  value = c(1, 5),
  category = c("category_1", "category_2")
)

use lapply, to loop over the rows of your combinations, each time feeding the attribute, value, and category value of the combination to the get_subset function

result = lapply(seq(1,nrow(combinations)), function(i) {
    items = data.table(get_subset(
      x=df,
      attribute_x = combinations[i,attribute],
      value_x = combinations[i,value],
      category_x = combinations[i, category],
      size=10))
    setNames(items, combinations[i,attribute])
})

That returns a list of data.tables with a single column, that is named after the attribute for that contribution

Use do.call with cbind to bind them into a single data.table

result = do.call(cbind, result)

If this is your original df

df = data.table(
  item_id = paste0("x", stringr::str_pad(sample(1:1000,100),width = 4,side = "left",pad = "0")),
  attribute_1=sample(1:10, 100, replace=T),
  attribute_2=sample(1:10, 100, replace=T),
  category_1=sample(c(0,1),100, replace=T),
  category_2=sample(c(0,1),100, replace=T)
)

then the above combinations, will return something like the following in result (Note, that my df example uses sample without seed, so your results will differ)

    attribute_1 attribute_2
 1:       x7511       x7302
 2:       x0325       x0956
 3:       x6748       x1893
 4:       x4462       x6958
 5:       x4146        <NA>
 6:       x5407        <NA>
 7:        <NA>        <NA>
 8:        <NA>        <NA>
 9:        <NA>        <NA>
10:        <NA>        <NA>