Filter columns in a dataframe using a specific word-CodePudding

I have a data frame that contains 70 columns, most of them are numeric variables collecting gender data. I'd like to subset the data frame so I only have male numeric variables and female numeric variables. The goal is to get the total sum of both male and female counts in the dataframe.

Hypothetically the DF looks like this

Client <- c("A", "B", "C","D","B","D","C")
ContactPerson <- c("Andrew","Mary","John","Barbara","Mary","Barbara","John")
`Rural Male persons` <- c(3, 1, 3, 4, 5, 3, 1)
`Rural female persons` <- c(3, 5, 3, 2, 6,2,1)
`Urban Male persons` <- c(4, 2, 5, 1, 0, 4, 2)
`Urban female persons` <- c(6, 9, 1, 7, 3, 2, 1)

DF <- data.frame(Client, ContactPerson, `Rural Male persons`, `Rural female persons`, `Urban Male persons`, `Urban female persons`)

I've tried subsetting to get Male only numeric variables using these functions

 Male <- DF %>%
      select(matches("Male|Client|ContactPerson"))

and

Male <- DF %>%
  select(contains(c("Partner", "Person", "Male")))

But it still brings both Male and female variables. I guess the reason for this is because of presence of the word Male in Female. Is there a way to explicitly subset these columns using a specific word. e.g. Male?

CodePudding user response：

Yes you can do this with grep.

For example like this:

Client <- c("A", "B", "C","D","B","D","C")
ContactPerson <- c("Andrew","Mary","John","Barbara","Mary","Barbara","John")
`Rural Male persons` <- c(3, 1, 3, 4, 5, 3, 1)
`Rural female persons` <- c(3, 5, 3, 2, 6,2,1)
`Urban Male persons` <- c(4, 2, 5, 1, 0, 4, 2)
`Urban female persons` <- c(6, 9, 1, 7, 3, 2, 1)

DF <- data.frame(Client, ContactPerson, `Rural Male persons`, `Rural female persons`, `Urban Male persons`, `Urban female persons`)

col_names <- colnames(DF)
male_cols <- grep('\\bMale\\b', col_names, value = T)

Male <- DF %>%
  select(male_cols)

CodePudding user response：

contains is not regex-sensitive; you need to use matches (which is regex-sensitive and which matches irrespective of case so, in your case, it will match both "male" and "Male"!); also to avoid matching the string "females", which contains the string "males", you need to draw word boundaries \\b around "Male":

DF %>%
  select(matches("\\bMale\\b"))
  Rural.Male.persons Urban.Male.persons
1                  3                  4
2                  1                  2
3                  3                  5
4                  4                  1
5                  5                  0
6                  3                  4
7                  1                  2