From one of my previous questions (Creating adjacency matrix with dirty dataset), I was able to clean nearly all of my data. Thank you, you brilliant coders. However, as I am trying to learn how the "playground" works, I continue to run into a comma issue.
Dataset originally looks like -
Species Association Year
<fctr> <chr> <dbl>
1 RC SKS/BW NA
2 BW Sykes, rc NA
3 SKS Babo/bw NA
4 RC baboon, mangabey NA
5 Mang red colobus, bw, sykes NA
6 SKS babo/red duiker NA
11 BW r/c monkeys 12
21 RC b/w colobus 12
31 SKS b/w colobus/R/c monkeys 12
41 BW sykes/R/c monkeys 12
51 RC sykes/b/w colobus 12
61 BABO - 12
7 SKS - 12
8 RC - 12
9 SKS r/c monkeys 12
10 RC sykes monkeys 12
53 BW sykes,b/w colobus 12
57 BW r/c monkeys,bw 12
58 Mang sykes,R/c monkeys 12
Dput -
dat <- structure(list(Species = c("RC", "BW", "SKS", "RC", "Mang", "SKS",
"BW", "RC", "SKS", "BW", "RC", "BABO", "SKS", "RC", "SKS", "RC", "BW", "BW", "Mang"
), Association = c("SKS/BW", "Sykes, rc", "Babo/bw", "baboon, mangabey",
"red colobus, bw, sykes", "babo/red duiker", "r/c monkeys", "b/w colobus",
"b/w colobus/R/c monkeys", "sykes/R/c monkeys", "sykes/b/w colobus",
".", ".", ".", "r/c monkeys", "sykes monkeys", "sykes,b/w colobus", "r/c monkeys,bw", "sykes,R/c monkeys"), year = c(NA, NA, NA, NA, NA, NA, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12)), row.names = c("1", "2", "3", "4", "5", "6", "11", "21", "31", "41", "51", "61", "7", "8", "9", "10", "53", "57", "58"), class = "data.frame")
For cleaning, I created a dictionary and then used regex to capture all of the variation in the association column except the last three rows since they are separated by a ',' and not '/'
dict <- read.table(header=TRUE, text='
from to
"BABO" BABO
"yellow baboon" BABO
"BW" BW
"bw colobus" BW
"Bw" BW
"bw" BW
"Bw colobus" BW
"B/W COLOBUS" BW
"RC" RC
"RED COLOBUS" RC
"rc monkeys" RC
"Red colobus" RC
"R/C MONKEYS" RC
"Rc monkeys" RC
"MANGABEY" MANG
"MANGA" MANG
"mangabeys" MANG
"SKS" SKS
"SYKES" SKS
"SYKES MONKEYS" SKS
"sykes" SKS
"SYKES MONKEY" SKS
"RED DUIKER" RD
"Red duiker" RD
"Red Duiker V . Fresh dung" RD
')
regex <- '(?<=\\w{2})\\/|,\\s'
spf <- "%s"
data.frame(from=
sprintf(spf,
sort(unique(unlist(
strsplit(toupper(dat$Association), regex, perl=TRUE)))))) |>
print(row.names=FALSE)
res <- strsplit(toupper(dat$Association), regex, perl=TRUE) |>
lapply(\(x) dict[match(x, dict$from), ]$to) |>
sapply(toString) |>
{\(.) replace(., . == ".", NA)}() |>
data.frame('Protected', as.factor(toupper(dat$Species)), dat$year) |>
setNames(c('association', 'site', 'species', 'year')) |>
subset(select=c(3, 1, 2, 4))
Giving me a final data frame of -
Species Association Site Year
<fctr> <chr> <chr> <dbl>
1 RC SKS, BW Protected NA
2 BW SKS, RC Protected NA
3 SKS BABO, BW Protected NA
4 RC BABO, MANG Protected NA
5 MANG RC, BW, SKS Protected NA
6 SKS BABO, RD Protected NA
7 BW RC Protected 12
8 RC BW Protected 12
9 SKS BW, RC Protected 12
10 BW SKS, RC Protected 12
11 RC SKS, BW Protected 12
12 BABO NA Protected 12
13 SKS NA Protected 12
14 RC NA Protected 12
15 SKS RC Protected 12
16 RC SKS Protected 12
17 BW NA Protected 12
18 BW NA Protected 12
19 MANG NA Protected 12
I want to include the last three rows to read the correct association (i.e. SKS, BW; RC, BW; SKS, RC), but everything I am reading about regex has comma use as part of the expression and not as part of what you're finding in the string. Is there a way to include it so it will give the correct output? I am still very new to regex and pretty new to R. Any help is greatly appreciated.
CodePudding user response:
The issue lies with your dictionary. Use tidyverse as shown below:
library(tidyverse)
dict1 <- dict %>%
add_row(from = 'BABOON', to = 'BABO') %>%
add_row(from='.', to = NA) %>%
add_row(from = '/', to = ',')
mutate(from = toupper(from))%>%
distinct() %>%
arrange(desc(nchar(from)))
dat %>%
mutate(Association = str_replace_all(toupper(Association),
fixed(setNames(dict1$to, dict1$from))),
Site = 'Protected')
Species Association year Site
1 RC SKS,BW NA Protected
2 BW SKS, RC NA Protected
3 SKS BABO,BW NA Protected
4 RC BABO, MANG NA Protected
5 Mang RC, BW, SKS NA Protected
6 SKS BABO,RD NA Protected
11 BW RC 12 Protected
21 RC BW 12 Protected
31 SKS BW,RC 12 Protected
41 BW SKS,RC 12 Protected
51 RC SKS,BW 12 Protected
61 BABO <NA> 12 Protected
7 SKS <NA> 12 Protected
8 RC <NA> 12 Protected
9 SKS RC 12 Protected
10 RC SKS MONKEYS 12 Protected
53 BW SKS,BW 12 Protected
57 BW RC,BW 12 Protected
58 Mang SKS,RC 12 Protected