Home > OS >  Join by value in column A, if value is between column M and N
Join by value in column A, if value is between column M and N

Time:11-26

I want to join "Division" from table2 to "Industry" in table1. To do so I will have to match the HSICCD from tablet1 that is between "from" and "to" in table2.

Just to be clear: If HSICCD in table1 is between from/to in table2, I want to take value from table2$division and add it to table1$industry(Or a new column like in a join).

Is there a join function in R that does this without too much hassle? (dplyr solutions are desired, but I'm glad for every contribution!)

**Table1:**
PERMNO HSICCD Industry
    <dbl>  <dbl> <lgl>   
 1  10000   3990 NA      
 2  10001   4925 NA      
 3  10002   6020 NA      
 4  10003   6020 NA      
 5  10004   5330 NA      
 6  10005   1310 NA      
 7  10006   3743 NA      
 8  10007   7370 NA      
 9  10008   3430 NA      
10  10009   6030 NA  

**Table2:**
        from    to division      
       <dbl> <dbl> <chr>         
     1   100   999 Agriculture   
     2  1000  1499 Mining        
     3  1500  1799 Construction  
     4  1800  1999 Other         
     5  2000  3999 Manufacturing 
     6  4000  4999 Transportation
     7  5000  5199 Wholesale     
     8  5200  5999 Retail        
     9  6000  6799 Finance       
    10  7000  8999 Services      
    11  9100  9729 Public        
    12  9900  9999 Other

My only solution so far is this horrendous code:

Compustat_identifiers$Industry <- NA

for (hsiccd in 1:nrow(Compustat_identifiers)) {
  for (SIC in 1:nrow(sic_table)) {
    if (is.na(Compustat_identifiers$HSICCD[hsiccd]) == T) {
      Compustat_identifiers$Industry[hsiccd] <- "Other"
    } else if (Compustat_identifiers$HSICCD[hsiccd] >= sic_table$from[SIC] &
               Compustat_identifiers$HSICCD[hsiccd] <= sic_table$to[SIC]) {
                Compustat_identifiers$Industry[hsiccd] <- sic_table$division[SIC]
    }
  }
}

CodePudding user response:

With base R you can do this.

First collect HSICCD from tbl1 and match with the condition to tbl2$division. Then simply put back into tbl1$Industry.

tbl1$Industry <- sapply( tbl1$HSICCD, 
     function(x) tbl2[ apply( tbl2, 1,
     function(y) x>y[1]&x<y[2] ),"division"] )

   PERMNO HSICCD       Industry
1   10000   3990  Manufacturing
2   10001   4925 Transportation
3   10002   6020        Finance
4   10003   6020        Finance
5   10004   5330         Retail
6   10005   1310         Mining
7   10006   3743  Manufacturing
8   10007   7370       Services
9   10008   3430  Manufacturing
10  10009   6030        Finance

Data

tbl1 <- structure(list(PERMNO = 10000:10009, HSICCD = c(3990L, 4925L, 
6020L, 6020L, 5330L, 1310L, 3743L, 7370L, 3430L, 6030L), Industry = c(NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c("1", "2", 
"3", "4", "5", "6", "7", "8", "9", "10"), class = "data.frame")

tbl2 <- structure(list(from = c(100L, 1000L, 1500L, 1800L, 2000L, 4000L, 
5000L, 5200L, 6000L, 7000L, 9100L, 9900L), to = c(999L, 1499L, 
1799L, 1999L, 3999L, 4999L, 5199L, 5999L, 6799L, 8999L, 9729L, 
9999L), division = c("Agriculture", "Mining", "Construction", 
"Other", "Manufacturing", "Transportation", "Wholesale", "Retail", 
"Finance", "Services", "Public", "Other")), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))

CodePudding user response:

I I understand correctly: We could try 2 steps:

  1. Use fuzzy_inner_join to check and join whether HSICCD values are in the range from to to
  2. left_join by HSICCD and clean results
library(dplyr)
library(fuzzyjoin)
result <-  fuzzy_inner_join(Table1, Table2,
                            by = c(HSICCD = 'from', HSICCD = 'to'),
                            match_fun = list(`>=`, `<=`))

left_join(Table1, result, by="HSICCD") %>% 
  select(PERMNO=PERMNO.x, HSICCD, Industry=division)

   PERMNO HSICCD       Industry
1   10000   3990  Manufacturing
2   10001   4925 Transportation
3   10002   6020        Finance
4   10002   6020        Finance
5   10003   6020        Finance
6   10003   6020        Finance
7   10004   5330         Retail
8   10005   1310         Mining
9   10006   3743  Manufacturing
10  10007   7370       Services
11  10008   3430  Manufacturing
12  10009   6030        Finance

CodePudding user response:

You can use ifelse():

for(i in 1:(dim(table2)[1])){
  table1$Industry <- ifelse((table1$HSICCD>table2$from[i] & table1$HSICCD<table2$to[i]), 
                            table2$division[i], 
                            table1$Industry)
}

CodePudding user response:

You could pivot_longer table 2, then complete() so that your Id goes from 1:n and left join to table 1

  • Related