I want to join "Division" from table2 to "Industry" in table1. To do so I will have to match the HSICCD from tablet1 that is between "from" and "to" in table2.
Just to be clear: If HSICCD in table1 is between from/to in table2, I want to take value from table2$division and add it to table1$industry(Or a new column like in a join).
Is there a join function in R that does this without too much hassle? (dplyr solutions are desired, but I'm glad for every contribution!)
**Table1:**
PERMNO HSICCD Industry
<dbl> <dbl> <lgl>
1 10000 3990 NA
2 10001 4925 NA
3 10002 6020 NA
4 10003 6020 NA
5 10004 5330 NA
6 10005 1310 NA
7 10006 3743 NA
8 10007 7370 NA
9 10008 3430 NA
10 10009 6030 NA
**Table2:**
from to division
<dbl> <dbl> <chr>
1 100 999 Agriculture
2 1000 1499 Mining
3 1500 1799 Construction
4 1800 1999 Other
5 2000 3999 Manufacturing
6 4000 4999 Transportation
7 5000 5199 Wholesale
8 5200 5999 Retail
9 6000 6799 Finance
10 7000 8999 Services
11 9100 9729 Public
12 9900 9999 Other
My only solution so far is this horrendous code:
Compustat_identifiers$Industry <- NA
for (hsiccd in 1:nrow(Compustat_identifiers)) {
for (SIC in 1:nrow(sic_table)) {
if (is.na(Compustat_identifiers$HSICCD[hsiccd]) == T) {
Compustat_identifiers$Industry[hsiccd] <- "Other"
} else if (Compustat_identifiers$HSICCD[hsiccd] >= sic_table$from[SIC] &
Compustat_identifiers$HSICCD[hsiccd] <= sic_table$to[SIC]) {
Compustat_identifiers$Industry[hsiccd] <- sic_table$division[SIC]
}
}
}
CodePudding user response:
With base R you can do this.
First collect HSICCD from tbl1
and match with the condition to tbl2$division
. Then simply put back into tbl1$Industry
.
tbl1$Industry <- sapply( tbl1$HSICCD,
function(x) tbl2[ apply( tbl2, 1,
function(y) x>y[1]&x<y[2] ),"division"] )
PERMNO HSICCD Industry
1 10000 3990 Manufacturing
2 10001 4925 Transportation
3 10002 6020 Finance
4 10003 6020 Finance
5 10004 5330 Retail
6 10005 1310 Mining
7 10006 3743 Manufacturing
8 10007 7370 Services
9 10008 3430 Manufacturing
10 10009 6030 Finance
Data
tbl1 <- structure(list(PERMNO = 10000:10009, HSICCD = c(3990L, 4925L,
6020L, 6020L, 5330L, 1310L, 3743L, 7370L, 3430L, 6030L), Industry = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c("1", "2",
"3", "4", "5", "6", "7", "8", "9", "10"), class = "data.frame")
tbl2 <- structure(list(from = c(100L, 1000L, 1500L, 1800L, 2000L, 4000L,
5000L, 5200L, 6000L, 7000L, 9100L, 9900L), to = c(999L, 1499L,
1799L, 1999L, 3999L, 4999L, 5199L, 5999L, 6799L, 8999L, 9729L,
9999L), division = c("Agriculture", "Mining", "Construction",
"Other", "Manufacturing", "Transportation", "Wholesale", "Retail",
"Finance", "Services", "Public", "Other")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
CodePudding user response:
I I understand correctly: We could try 2 steps:
- Use
fuzzy_inner_join
to check and join whetherHSICCD
values are in the rangefrom
toto
left_join
byHSICCD
and clean results
library(dplyr)
library(fuzzyjoin)
result <- fuzzy_inner_join(Table1, Table2,
by = c(HSICCD = 'from', HSICCD = 'to'),
match_fun = list(`>=`, `<=`))
left_join(Table1, result, by="HSICCD") %>%
select(PERMNO=PERMNO.x, HSICCD, Industry=division)
PERMNO HSICCD Industry
1 10000 3990 Manufacturing
2 10001 4925 Transportation
3 10002 6020 Finance
4 10002 6020 Finance
5 10003 6020 Finance
6 10003 6020 Finance
7 10004 5330 Retail
8 10005 1310 Mining
9 10006 3743 Manufacturing
10 10007 7370 Services
11 10008 3430 Manufacturing
12 10009 6030 Finance
CodePudding user response:
You can use ifelse()
:
for(i in 1:(dim(table2)[1])){
table1$Industry <- ifelse((table1$HSICCD>table2$from[i] & table1$HSICCD<table2$to[i]),
table2$division[i],
table1$Industry)
}
CodePudding user response:
You could pivot_longer table 2, then complete() so that your Id goes from 1:n and left join to table 1