Home > Software engineering >  Select columns from a data frame
Select columns from a data frame

Time:09-19

I have a Data Frame made up of several columns, each corresponding to a different industry per country. I have 56 industries and 43 countries and I'd select only industries from 5 to 22 per country (18 industries). The big issue is that each industry per country is named as: AUS1, AUS2 ..., AUS56. What I shall select is AUS5 to AUS22, AUT5 to AUT22 .... A viable solution could be to select columns according to the following algorithm: the first column of interest, i.e., AUS5 corresponds to column 10 and then I select up to AUS22 (corresponding to column 27). Then, I should skip all the remaining column for AUS (i.e. AUS23 to AUS56), and the first 4 columns for the next country (from AUT1 to AUT4). Then, I select, as before, industries from 5 to 22 for AUT. Basically, the algorithm, starting from column 10 should be able to select 18 columns(including column 10) and then skip the next 38 columns, and then select the next 18 columns. This process should be repeated for all the 43 countries. How can I code that?

UPDATE, Example:

df=data.frame(industry = c("C10","C11","C12","C13"),
                   country = c("USA"),
                   AUS3 = runif(4),
                   AUS4 = runif(4),
                   AUS5 = runif(4),
                   AUS6 = runif(4),
                   DEU5 = runif(4),
                   DEU6 = runif(4),
                   DEU7 = runif(4),
                   DEU8 = runif(4))
#I'm interested only in C10-c11:

df_a=df %>% filter(grepl('C10|C11',industry))
df_a

#Thus, how can I select columns AUS10,AUS11, DEU10,DEU11 efficiently, considering that I have a huge dataset?

CodePudding user response:

Demonstrating the paste0 approach.

ctr <- unique(gsub('\\d', '', names(df[-(1:2)])))
# ctr <- c("AUS", "DEU")  ## alternatively hard-coded

ind <- c(10, 11)

subset(df, industry == paste0('C', 10:11), 
       select=c('industry', 'country', paste0(rep(ctr, each=length(ind)), ind)))
#   industry country     AUS10     AUS11     DEU10     DEU11
# 1      C10     USA 0.3376674 0.1568496 0.5033433 0.7327734
# 2      C11     USA 0.7421840 0.6808892 0.9050158 0.3689741

Or, since you appear to like grep you could do.

df[grep('10|11', df$industry), grep('industry|country|[A-Z]{3}1[01]', names(df))]
#   industry country     AUS10     AUS11     DEU10     DEU11
# 1      C10     USA 0.3376674 0.1568496 0.5033433 0.7327734
# 2      C11     USA 0.7421840 0.6808892 0.9050158 0.3689741

CodePudding user response:

If you have a big data set in memory, data.table could be ideal and much faster than alternatives. Something like the following could work, though you will need to play with select_ind and select_ctr as desired on the real dataset.

It might be worth giving us a slightly larger toy example, if possible.

library(data.table)
setDT(df)


select_ind <- paste0(c("C"), c("11","10"))
select_ctr <- paste0(rep(c("AUS", "DEU"), each = 2), c("10","11"))


df[grepl(paste0(select_ind, collapse = "|"), industry), # select rows
   ..select_ctr] # select columns    

       AUS10     AUS11     DEU10     DEU11
1: 0.9040223 0.2638725 0.9779399 0.1672789
2: 0.6162678 0.3095942 0.1527307 0.6270880

For more information, see Introduction to data.table.

  • Related