How to merge a rows of a dataframe with the first available match in R?-CodePudding

I have two dataframes: one (called df_persons) with records that are have unique person_id's, but have stratum_id's that are not unique, and one (called df_population) with those same stratum_id's, and multiple duplicate rows of them. Code to recreate them below:

df_persons    = data.frame(person_id=c(101, 102, 103), stratum_id=c(1,2,1))
df_population = data.frame(stratum_id=c(1,1,1,1,2,2,2,2,3,3))

Now I would like a way to merge the data from df_persons with df_population, so that every row from df_persons gets merged with the first matching (key = stratum_id) row of df_population that has not been previously matched. Find the desired solution below:

# manual way to merge first available match
df_population$person = c(101, 103, NA, NA, 102, NA, NA, NA, NA, NA)

I wrote a loop for this that works (see below). The problem is that df_persons is 83.000 records long, and df_population is 13 million records long. And the loop therefore takes too long my pc cannot handle it.

# create empty person column in df_population
df_population$person = NA

# order both df's to speed up
df_population = df_population[order(df_population$stratum_id),]
df_persons    = df_persons[order(df_persons$stratum_id),]

# loop through all persons in df_person, and for each find the first available match
for(i_person in 1:nrow(df_persons))
{
  match = F
  i_pop = 0

  while(!match)
  {
    i_pop = i_pop 1
    if(df_population$stratum_id[i_pop] == df_persons$stratum_id[i_person] & is.na(df_population$person[i_pop]))
    {
      match = T
      df_population$person[i_pop] = df_persons$person[i_person]
    }
  }
}

Any help to make this a lot faster would be much appreciated. I have looked into the data.frame package, to no avail so far, but I do think I will need to move away from looping in order to execute the code.

CodePudding user response：

Here is a data.table approach. More explanation in the code's comments.

library(data.table)
# make them data.table
setDT(df_persons)
setDT(df_population)
# create dummy values to join on
df_persons[, id := rowid(stratum_id)]
df_population[, id := rowid(stratum_id)]
# join by refence
df_population[df_persons, person_id := i.person_id, on = .(stratum_id, id)][]
# drop the dummy id column
df_population[, id := NULL][]
#    stratum_id person_id
# 1:          1       101
# 2:          1       103
# 3:          1        NA
# 4:          1        NA
# 5:          2       102
# 6:          2        NA
# 7:          2        NA
# 8:          2        NA
# 9:          3        NA
#10:          3        NA

CodePudding user response：

Simply use pmatch as shown below:

df_population$person_id <- df_persons$person_id[pmatch(df_population$stratum_id, df_persons$stratum_id)]

df_population
   stratum_id person_id
1           1       101
2           1       103
3           1        NA
4           1        NA
5           2       102
6           2        NA
7           2        NA
8           2        NA
9           3        NA
10          3        NA

CodePudding user response：

1) dplyr Using dplyr add a sequence number to each data frame and then merge them:

library(dplyr)

df_population %>%
  group_by(stratum_id) %>%
  mutate(seq = 1:n()) %>%
  ungroup %>%
  left_join(df_persons %>% group_by(stratum_id) %>% mutate(seq = 1:n()))

giving:

Joining, by = c("stratum_id", "seq")
# A tibble: 10 x 3
   stratum_id   seq person_id
        <dbl> <int>     <dbl>
 1          1     1       101
 2          1     2       103
 3          1     3        NA
 4          1     4        NA
 5          2     1       102
 6          2     2        NA
 7          2     3        NA
 8          2     4        NA
 9          3     1        NA
10          3     2        NA

2) Base R or in base R:

p1 <- transform(df_population, seq = ave(stratum_id, stratum_id, FUN = seq_along))
p2 <- transform(df_persons, seq = ave(stratum_id, stratum_id, FUN = seq_along))
merge(p1, p2, all.x = TRUE, all.y = FALSE)

3) sqldf In SQL we have the following. The dbname= argument causes it to perform the processing outside of R but if you have sufficient memory then it could be omitted and it will use memory within R.

library(sqldf)

seqno <- "sum(1) over (partition by stratum_id rows unbounded preceding)"

fn$sqldf("
  with 
    p1 as (select *, $seqno seq from df_population),
    p2 as (select *, $seqno seq from df_persons)
  select * from p1 left join p2 using (stratum_id, seq)
", dbname = tempfile())