Home > other >  How can I merge this unbalanced data in r?
How can I merge this unbalanced data in r?

Time:12-20

I have a data set but it's kinda weird. So I'm trying to fix it right.

a_name     b_name         latitude      longitude        min.distance
           kangseo        37.1562        126.9762         73.950928
           kangbook       38.1255        126.1952         128.179185
           jongro         37.8226        127.1724         95.626161
           dongmoon       37.1161        127.1345         12.531519
           jamsil         36.1215        127.9722         175.154161
           nolboo         37.3213        127.3211         85.151616
           ilsan          37.1111        127.1331         16.115151
jongone                   38.1241        126.1215         95.626161
jongtwo                   37.1161        127.1345         12.531519
jongthree                 37.1562        126.9762         73.950928
jongfour                  37.1551        127.6262         17.124115
jongfive                  36.1515        127.9627         175.154161

This data is sepearated into two parts and I want to make it into clean data.

a_name     b_name         latitude      longitude        min.distance
jongthree  kangseo        37.1562        126.9762         73.950928
           kangbook       38.1255        126.1952         128.179185
jongone    jongro         38.1241        126.1215         95.626161
jongtwo   dongmoon        37.1161        127.1345         12.531519
jongfive   jamsil         36.1515        127.9627         175.154161
           nolboo         37.3213        127.3211         85.151616
jongfour   ilsan          37.1515        127.6262         17.124115
                                             

I tried left_join and merge but it didn't work well.

left_join(A,B, by="min.distance")

I need your help.

CodePudding user response:

You can use a full_join() here:

full_join(
  d %>% filter(is.na(b_name)) %>% select(-b_name),
  d %>% filter(is.na(a_name)) %>% select(b_name, min.distance),
) %>% relocate(b_name, .before=latitude)

Another way to get the same result is to combine the filter() steps into a single call to split()

d=split(d,is.na(d$a_name))
full_join(select(d[[1]], -b_name), select(d[[2]], c(b_name, min.distance)))

Output:

     a_name   b_name latitude longitude min.distance
1   jongone   jongro  38.1241  126.1215     95.62616
2   jongtwo dongmoon  37.1161  127.1345     12.53152
3 jongthree  kangseo  37.1562  126.9762     73.95093
4  jongfour     <NA>  37.1551  127.6262     17.12411
5  jongfive   jamsil  36.1515  127.9627    175.15416
6      <NA> kangbook       NA        NA    128.17918
7      <NA>   nolboo       NA        NA     85.15162
8      <NA>    ilsan       NA        NA     16.11515

Input:

d = structure(list(a_name = c(NA, NA, NA, NA, NA, NA, NA, "jongone", 
"jongtwo", "jongthree", "jongfour", "jongfive"), b_name = c("kangseo", 
"kangbook", "jongro", "dongmoon", "jamsil", "nolboo", "ilsan", 
NA, NA, NA, NA, NA), latitude = c(37.1562, 38.1255, 37.8226, 
37.1161, 36.1215, 37.3213, 37.1111, 38.1241, 37.1161, 37.1562, 
37.1551, 36.1515), longitude = c(126.9762, 126.1952, 127.1724, 
127.1345, 127.9722, 127.3211, 127.1331, 126.1215, 127.1345, 126.9762, 
127.6262, 127.9627), min.distance = c(73.950928, 128.179185, 
95.626161, 12.531519, 175.154161, 85.151616, 16.115151, 95.626161, 
12.531519, 73.950928, 17.124115, 175.154161)), class = "data.frame", row.names = c(NA, 
-12L))

CodePudding user response:

We can group by 'min.distance' and modify the NA elements to the non-NA values and get the distinct rows without any need for joining

library(dplyr)
df1 %>% 
   group_by(min.distance) %>%
   mutate(across(c(a_name, b_name), ~ (.x[!is.na(na_if(.x, ""))][1]))) %>% 
   ungroup %>%
   distinct(min.distance, .keep_all = TRUE)

-output

# A tibble: 8 × 5
  a_name    b_name   latitude longitude min.distance
  <chr>     <chr>       <dbl>     <dbl>        <dbl>
1 jongthree kangseo      37.2      127.         74.0
2 <NA>      kangbook     38.1      126.        128. 
3 jongone   jongro       37.8      127.         95.6
4 jongtwo   dongmoon     37.1      127.         12.5
5 jongfive  jamsil       36.1      128.        175. 
6 <NA>      nolboo       37.3      127.         85.2
7 <NA>      ilsan        37.1      127.         16.1
8 jongfour  <NA>         37.2      128.         17.1

data

df1 <- structure(list(a_name = c(NA, NA, NA, NA, NA, NA, NA, "jongone", 
"jongtwo", "jongthree", "jongfour", "jongfive"), b_name = c("kangseo", 
"kangbook", "jongro", "dongmoon", "jamsil", "nolboo", "ilsan", 
NA, NA, NA, NA, NA), latitude = c(37.1562, 38.1255, 37.8226, 
37.1161, 36.1215, 37.3213, 37.1111, 38.1241, 37.1161, 37.1562, 
37.1551, 36.1515), longitude = c(126.9762, 126.1952, 127.1724, 
127.1345, 127.9722, 127.3211, 127.1331, 126.1215, 127.1345, 126.9762, 
127.6262, 127.9627), min.distance = c(73.950928, 128.179185, 
95.626161, 12.531519, 175.154161, 85.151616, 16.115151, 95.626161, 
12.531519, 73.950928, 17.124115, 175.154161)), class = "data.frame", 
row.names = c(NA, 
-12L))

CodePudding user response:

If don't want to rely on exact matching of min.distance values, you can try a true spatial matching (and filter with some tolerance for distance discrepancies) like so:

library(sf) ## library for spatial processing

split df into a list of two dataframes, depending on the is.na() status of a_name, and name the list members as A and B:

df_list <- setNames(
  df1 |>  split(is.na(df1$a_name)),
  c('A', 'B')
)

convert both dataframes into spatial dataframes (containing a geometry column) in one go (by mapping the list):

df_list <- 
  df_list |> 
  map( ~ st_as_sf(.x, coords = c('latitude', 'longitude'))
)

find the nearest feature (row) of B for each row of A, and column-bind this row to A:

df_list$A |>
  bind_cols(
    df_list$B |>
    slice(st_nearest_feature(df_list$A, df_list$B))
  ) |>
## select (only) and rename desired columns):
select('a_name' = 1, 'b_name' = 6,
       'a_min_dist' = 3, 'b_min_dist' = 7
       ) |>
mutate(mutual_distance = abs(a_min_dist - b_min_dist))

output

##      a_name   b_name a_min_dist b_min_dist mutual_distance
## 1   jongone kangbook   95.62616  128.17918        32.55302
## 2   jongtwo dongmoon   12.53152   12.53152         0.00000
## 3 jongthree  kangseo   73.95093   73.95093         0.00000
## 4  jongfour   nolboo   17.12411   85.15162        68.02750
## 5  jongfive   jamsil  175.15416  175.15416         0.00000

Now you can filter by acceptable mutual_distance.

  • Related