I'm working with a massive dataset in R. I have to delete all the rows not matching or partially matching some columns' names. Let me make a simple example:
df=data.frame(industry = c("C10-c12","C13-C15","C16","C17","C18","C21","C26"),
country = c("USA"),
AUS5 = runif(7),
AUS6 = runif(7),
AUS7 = runif(7),
AUS8 = runif(7),
AUS9 = runif(7),
AUS10 = runif(7),
AUS11 = runif(7),
DEU5= runif(7),
DEU6 = runif(7),
DEU7 = runif(7),
DEU8= runif(7),
DEU9=runif(7),
DEU10=runif(7),
DEU11=runif(7))
How can I keep only those rows matching DEU as a country (so I want to discard those rows matching AUS as country) but for DEU rows I want to keep only DEU 5 , DEU 6, and DEU 10. OF course I'm looking for an efficient routine since I have a dozen of industrial sectors (i.e., the numbers following the country name) but luckily only a few countries to discard. That's why it could be better to discard unmatching countries first and then retain the matching industries.
CodePudding user response:
You are looking for the matches
tidy-selector here.
matches()
selects columns based on pattern matching on column names.
library(dplyr)
df %>% select(c(industry, country, matches('DEU(5|6|(10))')))
industry country DEU5 DEU6 DEU10
1 C10-c12 USA 0.2608170 0.1457456 0.4039807
2 C13-C15 USA 0.9305637 0.5463292 0.5370193
3 C16 USA 0.5528665 0.7273499 0.5419586
4 C17 USA 0.5570023 0.1176463 0.6929336
5 C18 USA 0.2887649 0.3265162 0.9069305
6 C21 USA 0.8594316 0.3638107 0.3654471
7 C26 USA 0.1549968 0.7596185 0.6334502
We can also use num_range
, which defines both a pattern
and a number range
df %>% select(c(industry, country, num_range(prefix = "DEU", range = c(5,6,10))))
industry country DEU5 DEU6 DEU10
1 C10-c12 USA 0.2608170 0.1457456 0.4039807
2 C13-C15 USA 0.9305637 0.5463292 0.5370193
3 C16 USA 0.5528665 0.7273499 0.5419586
4 C17 USA 0.5570023 0.1176463 0.6929336
5 C18 USA 0.2887649 0.3265162 0.9069305
6 C21 USA 0.8594316 0.3638107 0.3654471
7 C26 USA 0.1549968 0.7596185 0.6334502
If we want to match just the trailing numbers, we can still use matches()
:
df %>% select(c(industry, country, matches('5|6|(10)')))
industry country AUS5 AUS6 AUS10 DEU5 DEU6 DEU10
1 C10-c12 USA 0.29360716 0.5946399 0.81057924 0.2608170 0.1457456 0.4039807
2 C13-C15 USA 0.39537172 0.6075995 0.38691946 0.9305637 0.5463292 0.5370193
3 C16 USA 0.27181800 0.5429097 0.08827088 0.5528665 0.7273499 0.5419586
4 C17 USA 0.71401203 0.4248934 0.11901394 0.5570023 0.1176463 0.6929336
5 C18 USA 0.38785040 0.9107354 0.23835065 0.2887649 0.3265162 0.9069305
6 C21 USA 0.08476850 0.9222833 0.94125342 0.8594316 0.3638107 0.3654471
7 C26 USA 0.02270776 0.5137409 0.86546239 0.1549968 0.7596185 0.6334502