Home > OS >  Regex comma use in data cleaning with R
Regex comma use in data cleaning with R

Time:03-10

From one of my previous questions (Creating adjacency matrix with dirty dataset), I was able to clean nearly all of my data. Thank you, you brilliant coders. However, as I am trying to learn how the "playground" works, I continue to run into a comma issue.

Dataset originally looks like -

Species    Association                  Year
<fctr>     <chr>                        <dbl>
1   RC     SKS/BW                       NA  
2   BW     Sykes, rc                    NA
3   SKS    Babo/bw                      NA
4   RC     baboon, mangabey             NA
5   Mang   red colobus, bw, sykes       NA
6   SKS    babo/red duiker              NA
11  BW     r/c monkeys                  12
21  RC     b/w colobus                  12
31  SKS    b/w colobus/R/c monkeys      12
41  BW     sykes/R/c monkeys            12
51  RC     sykes/b/w colobus            12
61  BABO   -                            12
7   SKS    -                            12
8   RC     -                            12
9   SKS    r/c monkeys                  12
10  RC     sykes monkeys                12
53  BW     sykes,b/w colobus            12
57  BW     r/c monkeys,bw               12
58  Mang   sykes,R/c monkeys            12

Dput -

dat <- structure(list(Species = c("RC", "BW", "SKS", "RC", "Mang", "SKS", 
"BW", "RC", "SKS", "BW", "RC", "BABO", "SKS", "RC", "SKS", "RC", "BW", "BW", "Mang"
), Association = c("SKS/BW", "Sykes, rc", "Babo/bw", "baboon, mangabey", 
"red colobus, bw, sykes", "babo/red duiker", "r/c monkeys", "b/w colobus", 
"b/w colobus/R/c monkeys", "sykes/R/c monkeys", "sykes/b/w colobus", 
".", ".", ".", "r/c monkeys", "sykes monkeys", "sykes,b/w colobus", "r/c monkeys,bw", "sykes,R/c monkeys"), year = c(NA, NA, NA, NA, NA, NA, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12)), row.names = c("1", "2", "3", "4", "5", "6", "11", "21", "31", "41", "51", "61", "7", "8", "9", "10", "53", "57", "58"), class = "data.frame")

For cleaning, I created a dictionary and then used regex to capture all of the variation in the association column except the last three rows since they are separated by a ',' and not '/'

dict <- read.table(header=TRUE, text='
from  to
"BABO"  BABO
"yellow baboon"  BABO
"BW"  BW
"bw colobus" BW
"Bw" BW
"bw" BW
"Bw colobus" BW
"B/W COLOBUS" BW
"RC"  RC
"RED COLOBUS"  RC
"rc monkeys" RC
"Red colobus" RC
"R/C MONKEYS" RC
"Rc monkeys" RC
"MANGABEY"  MANG
"MANGA" MANG
"mangabeys" MANG
"SKS"  SKS
"SYKES"  SKS
"SYKES MONKEYS" SKS
"sykes" SKS
"SYKES MONKEY" SKS
"RED DUIKER"  RD
"Red duiker" RD
"Red Duiker   V . Fresh dung" RD
')

regex <- '(?<=\\w{2})\\/|,\\s'

spf <- "%s"

data.frame(from=
             sprintf(spf, 
                     sort(unique(unlist(
                       strsplit(toupper(dat$Association), regex, perl=TRUE)))))) |> 
                       print(row.names=FALSE)

res <- strsplit(toupper(dat$Association), regex, perl=TRUE) |>
  lapply(\(x) dict[match(x, dict$from), ]$to) |>
  sapply(toString) |>
  {\(.) replace(., . == ".", NA)}() |>
  data.frame('Protected', as.factor(toupper(dat$Species)), dat$year) |>
  setNames(c('association', 'site', 'species', 'year')) |>
  subset(select=c(3, 1, 2, 4))

Giving me a final data frame of -

Species    Association       Site           Year
<fctr>     <chr>             <chr>          <dbl>
1   RC     SKS, BW           Protected      NA
2   BW     SKS, RC           Protected      NA
3   SKS    BABO, BW          Protected      NA
4   RC     BABO, MANG        Protected      NA
5   MANG   RC, BW, SKS       Protected      NA
6   SKS    BABO, RD          Protected      NA
7   BW     RC                Protected      12
8   RC     BW                Protected      12
9   SKS    BW, RC            Protected      12
10  BW     SKS, RC           Protected      12
11  RC     SKS, BW           Protected      12
12  BABO   NA                Protected      12
13  SKS    NA                Protected      12
14  RC     NA                Protected      12
15  SKS    RC                Protected      12
16  RC     SKS               Protected      12
17  BW     NA                Protected      12
18  BW     NA                Protected      12
19  MANG   NA                Protected      12

I want to include the last three rows to read the correct association (i.e. SKS, BW; RC, BW; SKS, RC), but everything I am reading about regex has comma use as part of the expression and not as part of what you're finding in the string. Is there a way to include it so it will give the correct output? I am still very new to regex and pretty new to R. Any help is greatly appreciated.

CodePudding user response:

The issue lies with your dictionary. Use tidyverse as shown below:

library(tidyverse)
 dict1 <- dict %>%
  add_row(from = 'BABOON', to = 'BABO') %>%
  add_row(from='.', to = NA) %>%
  add_row(from = '/', to = ',')
  mutate(from = toupper(from))%>%
  distinct() %>%
  arrange(desc(nchar(from)))

dat %>%
  mutate(Association = str_replace_all(toupper(Association), 
                              fixed(setNames(dict1$to, dict1$from))),
         Site = 'Protected')


  Species Association year      Site
1       RC      SKS,BW   NA Protected
2       BW     SKS, RC   NA Protected
3      SKS     BABO,BW   NA Protected
4       RC  BABO, MANG   NA Protected
5     Mang RC, BW, SKS   NA Protected
6      SKS     BABO,RD   NA Protected
11      BW          RC   12 Protected
21      RC          BW   12 Protected
31     SKS       BW,RC   12 Protected
41      BW      SKS,RC   12 Protected
51      RC      SKS,BW   12 Protected
61    BABO        <NA>   12 Protected
7      SKS        <NA>   12 Protected
8       RC        <NA>   12 Protected
9      SKS          RC   12 Protected
10      RC SKS MONKEYS   12 Protected
53      BW      SKS,BW   12 Protected
57      BW       RC,BW   12 Protected
58    Mang      SKS,RC   12 Protected
  • Related