Home > OS >  How to extract a column from filtered dataframe in a function?
How to extract a column from filtered dataframe in a function?

Time:07-19

I am trying to wrap a function that can extract values from a data frame based on given parameters:

> f <- function(df, factor, level1, level2, response) {
      df1 <- df[df$factor == level1, ]$response
      df2 <- df[df$factor == level2, ]$response
      print(df1)
      print(df2)
  }
> f(ToothGrowth, supp, "VC", "OJ", len)
NULL
NULL

In the above example, the data frame is ToothGrowth, and the function f is trying to print two separated data frames: df1 contains an array of values from rows that supp column is equal to "VC"; and "OJ" for df2.

The equivalent python code that uses pandas is as follows (very straightforward and works well):

def f(df: pd.DataFrame, factor: str, level1: str, level2: str, response: str):
    p1 = df[df[factor] == level1][response]
    p2 = df[df[factor] == level2][response]
    print(p1, p2)

Why are these outputs just NULL in R? What would be the best practice here to achieve the goal?

I am new to R but aware of the tidyverse. However, it seems very strange to me, and don't really know if it is worthy to achieve this purpose to use such a heavy library.

CodePudding user response:

I am quite sure that you need something like this:

f <- function(df, factor, level, response) {
  df <- df %>% 
    dplyr::filter({{factor}}==level) %>% 
    pull({{response}})
  print(df)
  }


f(ToothGrowth, supp, "VC", len)

#gives:
 [1]  4.2 11.5  7.3  5.8  6.4 10.0 11.2 11.2  5.2  7.0 16.5 16.5 15.2 17.3 22.5
[16] 17.3 13.6 14.5 18.8 15.5 23.6 18.5 33.9 25.5 26.4 32.5 26.7 21.5 23.3 29.5

Explanation

What you produce with your function and the function call

f(ToothGrowth, "supp", "VC", "OJ", "len") is:

[1]  4.2 11.5  7.3  5.8  6.4 10.0 11.2 11.2  5.2  7.0 16.5 16.5 15.2 17.3 22.5
[16] 17.3 13.6 14.5 18.8 15.5 23.6 18.5 33.9 25.5 26.4 32.5 26.7 21.5 23.3 29.5
 [1] 15.2 21.5 17.6  9.7 14.5 10.0  8.2  9.4 16.5  9.7 19.7 23.3 23.6 26.4 20.0
[16] 25.2 25.8 21.2 14.5 27.3 25.5 26.4 22.4 24.5 24.8 30.9 26.4 27.3 29.4 23.0

If this is your aim (what I don't think) then you could do:

ToothGrowth$len gives:

 [1]  4.2 11.5  7.3  5.8  6.4 10.0 11.2 11.2  5.2  7.0 16.5 16.5 15.2 17.3 22.5
[16] 17.3 13.6 14.5 18.8 15.5 23.6 18.5 33.9 25.5 26.4 32.5 26.7 21.5 23.3 29.5
[31] 15.2 21.5 17.6  9.7 14.5 10.0  8.2  9.4 16.5  9.7 19.7 23.3 23.6 26.4 20.0
[46] 25.2 25.8 21.2 14.5 27.3 25.5 26.4 22.4 24.5 24.8 30.9 26.4 27.3 29.4 23.0

What makes more sense to me is that you need maybe something like this:

f <- function(df, factor, level) {
  df <- df %>% 
    dplyr::filter({{factor}}==level)
  print(df)
  }

f(ToothGrowth, supp, "VC")

What this function does is:

  1. from any dataframe you want identify a factor column in our case supp (due to tidy evaluation no "".

  2. filter the dataframe by the factor column identifying the factor level. I understand you may want to identify two levels. But this is easyl done by extending the filter line. In the function here we need "".

We get:

  len supp dose
1   4.2   VC  0.5
2  11.5   VC  0.5
3   7.3   VC  0.5
4   5.8   VC  0.5
5   6.4   VC  0.5
6  10.0   VC  0.5
7  11.2   VC  0.5
8  11.2   VC  0.5
9   5.2   VC  0.5
10  7.0   VC  0.5
11 16.5   VC  1.0
12 16.5   VC  1.0
13 15.2   VC  1.0
14 17.3   VC  1.0
15 22.5   VC  1.0
16 17.3   VC  1.0
17 13.6   VC  1.0
18 14.5   VC  1.0
19 18.8   VC  1.0
20 15.5   VC  1.0
21 23.6   VC  2.0
22 18.5   VC  2.0
23 33.9   VC  2.0
24 25.5   VC  2.0
25 26.4   VC  2.0
26 32.5   VC  2.0
27 26.7   VC  2.0
28 21.5   VC  2.0
29 23.3   VC  2.0
30 29.5   VC  2.0

And this is extendable as shown at the beginning of my answer: For example if we want to get the vector:

just add pull(response) to your function after adding response to the arguments, etc...

CodePudding user response:

R and Python have similarities and disimilarities. Your function in R should be writen like this:

f <- function(df, factor, level1, level2, response) {
  df1 <- df[df[,factor] == level1, ][,response]
  df2 <- df[df[,factor] == level2, ][,response]
  print(df1)
  print(df2)
}

When you write df$factor, you are saying that there is a column named "factor" instead of passing the value of (the object) factor as a name of a column.

And, as @rawr said, you have to quote the calling:

f(ToothGrowth, "supp", "VC", "OJ", "len")

This is because your are indicating the name of the columns (variables), so you are indicating names (strings), so they must be set between quotes.

  •  Tags:  
  • r
  • Related