I have a data set but it's kinda weird. So I'm trying to fix it right.
a_name b_name latitude longitude min.distance
kangseo 37.1562 126.9762 73.950928
kangbook 38.1255 126.1952 128.179185
jongro 37.8226 127.1724 95.626161
dongmoon 37.1161 127.1345 12.531519
jamsil 36.1215 127.9722 175.154161
nolboo 37.3213 127.3211 85.151616
ilsan 37.1111 127.1331 16.115151
jongone 38.1241 126.1215 95.626161
jongtwo 37.1161 127.1345 12.531519
jongthree 37.1562 126.9762 73.950928
jongfour 37.1551 127.6262 17.124115
jongfive 36.1515 127.9627 175.154161
This data is sepearated into two parts and I want to make it into clean data.
a_name b_name latitude longitude min.distance
jongthree kangseo 37.1562 126.9762 73.950928
kangbook 38.1255 126.1952 128.179185
jongone jongro 38.1241 126.1215 95.626161
jongtwo dongmoon 37.1161 127.1345 12.531519
jongfive jamsil 36.1515 127.9627 175.154161
nolboo 37.3213 127.3211 85.151616
jongfour ilsan 37.1515 127.6262 17.124115
I tried left_join and merge but it didn't work well.
left_join(A,B, by="min.distance")
I need your help.
CodePudding user response:
You can use a full_join()
here:
full_join(
d %>% filter(is.na(b_name)) %>% select(-b_name),
d %>% filter(is.na(a_name)) %>% select(b_name, min.distance),
) %>% relocate(b_name, .before=latitude)
Another way to get the same result is to combine the filter()
steps into a single call to split()
d=split(d,is.na(d$a_name))
full_join(select(d[[1]], -b_name), select(d[[2]], c(b_name, min.distance)))
Output:
a_name b_name latitude longitude min.distance
1 jongone jongro 38.1241 126.1215 95.62616
2 jongtwo dongmoon 37.1161 127.1345 12.53152
3 jongthree kangseo 37.1562 126.9762 73.95093
4 jongfour <NA> 37.1551 127.6262 17.12411
5 jongfive jamsil 36.1515 127.9627 175.15416
6 <NA> kangbook NA NA 128.17918
7 <NA> nolboo NA NA 85.15162
8 <NA> ilsan NA NA 16.11515
Input:
d = structure(list(a_name = c(NA, NA, NA, NA, NA, NA, NA, "jongone",
"jongtwo", "jongthree", "jongfour", "jongfive"), b_name = c("kangseo",
"kangbook", "jongro", "dongmoon", "jamsil", "nolboo", "ilsan",
NA, NA, NA, NA, NA), latitude = c(37.1562, 38.1255, 37.8226,
37.1161, 36.1215, 37.3213, 37.1111, 38.1241, 37.1161, 37.1562,
37.1551, 36.1515), longitude = c(126.9762, 126.1952, 127.1724,
127.1345, 127.9722, 127.3211, 127.1331, 126.1215, 127.1345, 126.9762,
127.6262, 127.9627), min.distance = c(73.950928, 128.179185,
95.626161, 12.531519, 175.154161, 85.151616, 16.115151, 95.626161,
12.531519, 73.950928, 17.124115, 175.154161)), class = "data.frame", row.names = c(NA,
-12L))
CodePudding user response:
We can group by 'min.distance' and modify the NA
elements to the non-NA values and get the distinct
rows without any need for joining
library(dplyr)
df1 %>%
group_by(min.distance) %>%
mutate(across(c(a_name, b_name), ~ (.x[!is.na(na_if(.x, ""))][1]))) %>%
ungroup %>%
distinct(min.distance, .keep_all = TRUE)
-output
# A tibble: 8 × 5
a_name b_name latitude longitude min.distance
<chr> <chr> <dbl> <dbl> <dbl>
1 jongthree kangseo 37.2 127. 74.0
2 <NA> kangbook 38.1 126. 128.
3 jongone jongro 37.8 127. 95.6
4 jongtwo dongmoon 37.1 127. 12.5
5 jongfive jamsil 36.1 128. 175.
6 <NA> nolboo 37.3 127. 85.2
7 <NA> ilsan 37.1 127. 16.1
8 jongfour <NA> 37.2 128. 17.1
data
df1 <- structure(list(a_name = c(NA, NA, NA, NA, NA, NA, NA, "jongone",
"jongtwo", "jongthree", "jongfour", "jongfive"), b_name = c("kangseo",
"kangbook", "jongro", "dongmoon", "jamsil", "nolboo", "ilsan",
NA, NA, NA, NA, NA), latitude = c(37.1562, 38.1255, 37.8226,
37.1161, 36.1215, 37.3213, 37.1111, 38.1241, 37.1161, 37.1562,
37.1551, 36.1515), longitude = c(126.9762, 126.1952, 127.1724,
127.1345, 127.9722, 127.3211, 127.1331, 126.1215, 127.1345, 126.9762,
127.6262, 127.9627), min.distance = c(73.950928, 128.179185,
95.626161, 12.531519, 175.154161, 85.151616, 16.115151, 95.626161,
12.531519, 73.950928, 17.124115, 175.154161)), class = "data.frame",
row.names = c(NA,
-12L))
CodePudding user response:
If don't want to rely on exact matching of min.distance
values, you can try a true spatial matching (and filter with some tolerance for distance discrepancies) like so:
library(sf) ## library for spatial processing
split df into a list of two dataframes, depending on the is.na()
status of a_name
, and name the list members as A and B:
df_list <- setNames(
df1 |> split(is.na(df1$a_name)),
c('A', 'B')
)
convert both dataframes into spatial dataframes (containing a geometry column
) in one go (by mapping the list):
df_list <-
df_list |>
map( ~ st_as_sf(.x, coords = c('latitude', 'longitude'))
)
find the nearest feature (row) of B for each row of A, and column-bind this row to A:
df_list$A |>
bind_cols(
df_list$B |>
slice(st_nearest_feature(df_list$A, df_list$B))
) |>
## select (only) and rename desired columns):
select('a_name' = 1, 'b_name' = 6,
'a_min_dist' = 3, 'b_min_dist' = 7
) |>
mutate(mutual_distance = abs(a_min_dist - b_min_dist))
output
## a_name b_name a_min_dist b_min_dist mutual_distance
## 1 jongone kangbook 95.62616 128.17918 32.55302
## 2 jongtwo dongmoon 12.53152 12.53152 0.00000
## 3 jongthree kangseo 73.95093 73.95093 0.00000
## 4 jongfour nolboo 17.12411 85.15162 68.02750
## 5 jongfive jamsil 175.15416 175.15416 0.00000
Now you can filter by acceptable mutual_distance
.