Home > front end >  Delete rows not partially matching certain column names
Delete rows not partially matching certain column names

Time:09-30

I'm working with a massive dataset in R. I have to delete all the rows not matching or partially matching some columns' names. Let me make a simple example:

df=data.frame(industry = c("C10-c12","C13-C15","C16","C17","C18","C21","C26"),
              country = c("USA"),
              AUS5 = runif(7),
              AUS6 = runif(7),
              AUS7 = runif(7),
              AUS8 = runif(7),
              AUS9 = runif(7),
              AUS10 = runif(7),
              AUS11 = runif(7),
              DEU5= runif(7),
              DEU6 = runif(7),
              DEU7 = runif(7),
              DEU8= runif(7),
              DEU9=runif(7),
              DEU10=runif(7),
              DEU11=runif(7))

How can I keep only those rows matching DEU as a country (so I want to discard those rows matching AUS as country) but for DEU rows I want to keep only DEU 5 , DEU 6, and DEU 10. OF course I'm looking for an efficient routine since I have a dozen of industrial sectors (i.e., the numbers following the country name) but luckily only a few countries to discard. That's why it could be better to discard unmatching countries first and then retain the matching industries.

CodePudding user response:

You are looking for the matches tidy-selector here. matches() selects columns based on pattern matching on column names.

library(dplyr)

df %>% select(c(industry, country, matches('DEU(5|6|(10))')))

  industry country      DEU5      DEU6     DEU10
1  C10-c12     USA 0.2608170 0.1457456 0.4039807
2  C13-C15     USA 0.9305637 0.5463292 0.5370193
3      C16     USA 0.5528665 0.7273499 0.5419586
4      C17     USA 0.5570023 0.1176463 0.6929336
5      C18     USA 0.2887649 0.3265162 0.9069305
6      C21     USA 0.8594316 0.3638107 0.3654471
7      C26     USA 0.1549968 0.7596185 0.6334502

We can also use num_range , which defines both a pattern and a number range

df %>% select(c(industry, country, num_range(prefix = "DEU", range = c(5,6,10))))

  industry country      DEU5      DEU6     DEU10
1  C10-c12     USA 0.2608170 0.1457456 0.4039807
2  C13-C15     USA 0.9305637 0.5463292 0.5370193
3      C16     USA 0.5528665 0.7273499 0.5419586
4      C17     USA 0.5570023 0.1176463 0.6929336
5      C18     USA 0.2887649 0.3265162 0.9069305
6      C21     USA 0.8594316 0.3638107 0.3654471
7      C26     USA 0.1549968 0.7596185 0.6334502

If we want to match just the trailing numbers, we can still use matches():

df %>% select(c(industry, country, matches('5|6|(10)')))

  industry country       AUS5      AUS6      AUS10      DEU5      DEU6     DEU10
1  C10-c12     USA 0.29360716 0.5946399 0.81057924 0.2608170 0.1457456 0.4039807
2  C13-C15     USA 0.39537172 0.6075995 0.38691946 0.9305637 0.5463292 0.5370193
3      C16     USA 0.27181800 0.5429097 0.08827088 0.5528665 0.7273499 0.5419586
4      C17     USA 0.71401203 0.4248934 0.11901394 0.5570023 0.1176463 0.6929336
5      C18     USA 0.38785040 0.9107354 0.23835065 0.2887649 0.3265162 0.9069305
6      C21     USA 0.08476850 0.9222833 0.94125342 0.8594316 0.3638107 0.3654471
7      C26     USA 0.02270776 0.5137409 0.86546239 0.1549968 0.7596185 0.6334502
  • Related