Home > Blockchain >  Comparing two columns in two different dataframe and replacing values missing
Comparing two columns in two different dataframe and replacing values missing

Time:11-02

I have a project on R as I am learning it, but I am totally lost about this problem : I have two dataframes (crime2020 and crime2021) but in crime2021, the column Primary.Type is missing. Each possible value of Primary.Type is linked with a particular "ID" corresponding to the value in the column IUCR. The idea is as follow : as I have IUCR in both dataframes, I want to compare the IUCR of 2020 and 2021, and when there are equal, I want to put the value of Primary.Type in 2021 in the dataframe of 2020. I don't think it is clear so I will give a simple example :

Dataframe 2020 :

IUCR Primary.Type
0820 NA
0460 NA

Dataframe 2021 :

IUCR Primary.Type
0820 Theft
0789 Robbery
0460 Battery

So here, for the first column of 2020, where IUCR=0820, I want to test if a IUCR in 2021 is equal to 0820 and if yes, replace with the value in Primary.Type for this row, so Theft should be put in the first row of the colum Primary.Type of 2020. And I want to do it for each row of 2020. So I want to compare each row of 2020 with each row of 2021. The problem is that both dataframes have 200.000 rows...

I tried first for just the first row (i=1) of my 2020 dataframe with the function that follows :

Create<-function(i){
  for(y in 1:nrow(crime2021)){
    ifelse(crime2020[i,5]==crime2021[y,5],
           crime2020[i,22]<-crime2021[y,6],
           (for(x in 1:(nrow(crime2021)-y)){
           ifelse(crime2020[i,5]==crime2021[y x,5], crime2020[i,22]<-crime2021[y x,5], crime2020[i,22]<-"NA")}
             ))}}

with :

  • crime2020[i,5] = IUCR 2020
  • crime2021[y,5] = IUCR 2021
  • crime2020[i,22] = Primary.Type 2020
  • crime2020[y,6] = Primary.Type 2021

I don't know if it works because this function is too "heavy" so when I ran Create(1) it takes too long to and I had to stop it.

If someone have a solution it will really helps me !

Thanks you, and don't hesitate to ask for precisions, I don't know if it is very clear !

CodePudding user response:

Joins are a great way to do this. In this case, we want to use the IUCR and Primary.type columns from crime2021 to augment crime2020, which has IUCR but is missing Primary.type. We want crime2020 to keep the same number of rows it has now, so in joining we need to make sure each crime2020$IUCR is only joined to one crime2021$IUCR. To do this, we can use dplyr::distinct to limit crime2021 to unique combinations of those two variables, and to leave out all the other extraneous variables from the join.

library(dplyr)
crime2020 <- crime_2020 %>% 
  left_join(crime2021 %>% distinct(IUCR, Primary.Type))

One way this might not work as expected is if any crime2021$IUCR values are paired with more than one crime2021$Primary.Types -- the same multiple matchings would then appear in crime2020 after the join. So you might want to run distinct(crime2021, IUCR, Primary.Type) %>% count(IUCR, sort = TRUE) to see if any ambiguous pairings exist in crime2021.

CodePudding user response:

We can use full_join and coalesce from dplyr:

library(dpplyr)

full_join(crime_2020, crime_2021, by = 'IUCR') %>%
    mutate(Primary.Type = coalesce(Primary.Type.x, Primary.Type.y),
           .keep = 'unused')

# A tibble: 3 × 2
   IUCR Primary.Type
  <dbl> <chr>       
1   820 Theft       
2   460 Battery     
3   789 Robbery 

data

crime_2020 <- tibble(IUCR = c(0820, 0460),
                     Primary.Type = c(NA, NA))


crime_2021 <- tibble(IUCR = c(0820, 0789, 0460),
                     Primary.Type = c('Theft', 'Robbery', 'Battery'))
  • Related