Delete rows not partially matching certain column names-CodePudding

I'm working with a massive dataset in R. I have to delete all the rows not matching or partially matching some columns' names. Let me make a simple example:

df=data.frame(industry = c("C10-c12","C13-C15","C16","C17","C18","C21","C26"),
              country = c("USA"),
              AUS5 = runif(7),
              AUS6 = runif(7),
              AUS7 = runif(7),
              AUS8 = runif(7),
              AUS9 = runif(7),
              AUS10 = runif(7),
              AUS11 = runif(7),
              DEU5= runif(7),
              DEU6 = runif(7),
              DEU7 = runif(7),
              DEU8= runif(7),
              DEU9=runif(7),
              DEU10=runif(7),
              DEU11=runif(7))

How can I keep only those rows matching DEU as a country (so I want to discard those rows matching AUS as country) but for DEU rows I want to keep only DEU 5 , DEU 6, and DEU 10. OF course I'm looking for an efficient routine since I have a dozen of industrial sectors (i.e., the numbers following the country name) but luckily only a few countries to discard. That's why it could be better to discard unmatching countries first and then retain the matching industries.

CodePudding user response：

You are looking for the matches tidy-selector here. matches() selects columns based on pattern matching on column names.

library(dplyr)

df %>% select(c(industry, country, matches('DEU(5|6|(10))')))

  industry country      DEU5      DEU6     DEU10
1  C10-c12     USA 0.2608170 0.1457456 0.4039807
2  C13-C15     USA 0.9305637 0.5463292 0.5370193
3      C16     USA 0.5528665 0.7273499 0.5419586
4      C17     USA 0.5570023 0.1176463 0.6929336
5      C18     USA 0.2887649 0.3265162 0.9069305
6      C21     USA 0.8594316 0.3638107 0.3654471
7      C26     USA 0.1549968 0.7596185 0.6334502

We can also use num_range , which defines both a pattern and a number range

df %>% select(c(industry, country, num_range(prefix = "DEU", range = c(5,6,10))))

  industry country      DEU5      DEU6     DEU10
1  C10-c12     USA 0.2608170 0.1457456 0.4039807
2  C13-C15     USA 0.9305637 0.5463292 0.5370193
3      C16     USA 0.5528665 0.7273499 0.5419586
4      C17     USA 0.5570023 0.1176463 0.6929336
5      C18     USA 0.2887649 0.3265162 0.9069305
6      C21     USA 0.8594316 0.3638107 0.3654471
7      C26     USA 0.1549968 0.7596185 0.6334502

If we want to match just the trailing numbers, we can still use matches():

df %>% select(c(industry, country, matches('5|6|(10)')))

  industry country       AUS5      AUS6      AUS10      DEU5      DEU6     DEU10
1  C10-c12     USA 0.29360716 0.5946399 0.81057924 0.2608170 0.1457456 0.4039807
2  C13-C15     USA 0.39537172 0.6075995 0.38691946 0.9305637 0.5463292 0.5370193
3      C16     USA 0.27181800 0.5429097 0.08827088 0.5528665 0.7273499 0.5419586
4      C17     USA 0.71401203 0.4248934 0.11901394 0.5570023 0.1176463 0.6929336
5      C18     USA 0.38785040 0.9107354 0.23835065 0.2887649 0.3265162 0.9069305
6      C21     USA 0.08476850 0.9222833 0.94125342 0.8594316 0.3638107 0.3654471
7      C26     USA 0.02270776 0.5137409 0.86546239 0.1549968 0.7596185 0.6334502