Home > database >  Extract disallowed characters
Extract disallowed characters

Time:12-02

I have transcriptions with erroneous encodings, that is, characters that occur but should not occur.

In this toy data, the only allowed characters are this class:

"[)(/][A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]"

df <- data.frame(
  Utterance = c("~°maybe you (.) >should ¥just¥<",
                "SOME text |<-- pipe¿ and€",            # <--: | and €
                "blah%",                                # <--: %
                "text ^more text",                      # <--: ^
                "£norm(hh)a::l£mal, (1.22)"))

What I need to do is:

  • detect Utterances that contain any wrong encodings
  • extract the wrong characters

I'm doing OK as far as detection is concerned but the extraction fails miserably:

library(stringr)
library(dplyr)
df %>%
  filter(!str_detect(Utterance, "[)(/][A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]")) %>%
  mutate(WrongChar = str_extract_all(Utterance, "[^)(/][A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]"))
                  Utterance                                  WrongChar
1 SOME text |<-- pipe¿ and€ SO, ME,  t, ex, |<, --,  p, ip, e¿,  a, nd
2                     blah%                                     bl, ah
3           text ^more text                     te, xt, ^m, or,  t, ex

How can the extraction be improved to obtain this expected result:

                  Utterance WrongChar
1 SOME text |<-- pipe¿ and€      |,2                     blah%         %
3           text ^more text         ^

CodePudding user response:

You need to

  • Ensure the [ and ] are escaped inside a character class
  • Add whitespace pattern to both regexp checks as its absence is messing your results.

So you need to use

df %>%
   filter(str_detect(Utterance, "[^\\s)(/\\]\\[A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]")) %>%
   mutate(WrongChar = str_extract_all(Utterance, "[^\\s)(/\\]\\[A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]"))

Output:

                  Utterance WrongChar
1 SOME text |<-- pipe¿ and€      |,2                     blah%         %
3           text ^more text         ^

Note that I used positive logic in filter(str_detect(Utterance, "[^\\s)(/\\]\\[A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]")), so we get all items that contain at least one char other than an allowed one.

  • Related