I have a project on R as I am learning it, but I am totally lost about this problem : I have two dataframes (crime2020 and crime2021) but in crime2021, the column Primary.Type is missing. Each possible value of Primary.Type is linked with a particular "ID" corresponding to the value in the column IUCR. The idea is as follow : as I have IUCR in both dataframes, I want to compare the IUCR of 2020 and 2021, and when there are equal, I want to put the value of Primary.Type in 2021 in the dataframe of 2020. I don't think it is clear so I will give a simple example :
Dataframe 2020 :
IUCR | Primary.Type |
---|---|
0820 | NA |
0460 | NA |
Dataframe 2021 :
IUCR | Primary.Type |
---|---|
0820 | Theft |
0789 | Robbery |
0460 | Battery |
So here, for the first column of 2020, where IUCR=0820, I want to test if a IUCR in 2021 is equal to 0820 and if yes, replace with the value in Primary.Type for this row, so Theft should be put in the first row of the colum Primary.Type of 2020. And I want to do it for each row of 2020. So I want to compare each row of 2020 with each row of 2021. The problem is that both dataframes have 200.000 rows...
I tried first for just the first row (i=1) of my 2020 dataframe with the function that follows :
Create<-function(i){
for(y in 1:nrow(crime2021)){
ifelse(crime2020[i,5]==crime2021[y,5],
crime2020[i,22]<-crime2021[y,6],
(for(x in 1:(nrow(crime2021)-y)){
ifelse(crime2020[i,5]==crime2021[y x,5], crime2020[i,22]<-crime2021[y x,5], crime2020[i,22]<-"NA")}
))}}
with :
- crime2020[i,5] = IUCR 2020
- crime2021[y,5] = IUCR 2021
- crime2020[i,22] = Primary.Type 2020
- crime2020[y,6] = Primary.Type 2021
I don't know if it works because this function is too "heavy" so when I ran Create(1) it takes too long to and I had to stop it.
If someone have a solution it will really helps me !
Thanks you, and don't hesitate to ask for precisions, I don't know if it is very clear !
CodePudding user response:
Joins are a great way to do this. In this case, we want to use the IUCR
and Primary.type
columns from crime2021
to augment crime2020
, which has IUCR
but is missing Primary.type
. We want crime2020
to keep the same number of rows it has now, so in joining we need to make sure each crime2020$IUCR
is only joined to one crime2021$IUCR
. To do this, we can use dplyr::distinct
to limit crime2021
to unique combinations of those two variables, and to leave out all the other extraneous variables from the join.
library(dplyr)
crime2020 <- crime_2020 %>%
left_join(crime2021 %>% distinct(IUCR, Primary.Type))
One way this might not work as expected is if any crime2021$IUCR
values are paired with more than one crime2021$Primary.Type
s -- the same multiple matchings would then appear in crime2020
after the join. So you might want to run distinct(crime2021, IUCR, Primary.Type) %>% count(IUCR, sort = TRUE)
to see if any ambiguous pairings exist in crime2021
.
CodePudding user response:
We can use full_join
and coalesce
from dplyr
:
library(dpplyr)
full_join(crime_2020, crime_2021, by = 'IUCR') %>%
mutate(Primary.Type = coalesce(Primary.Type.x, Primary.Type.y),
.keep = 'unused')
# A tibble: 3 × 2
IUCR Primary.Type
<dbl> <chr>
1 820 Theft
2 460 Battery
3 789 Robbery
data
crime_2020 <- tibble(IUCR = c(0820, 0460),
Primary.Type = c(NA, NA))
crime_2021 <- tibble(IUCR = c(0820, 0789, 0460),
Primary.Type = c('Theft', 'Robbery', 'Battery'))