I have transcriptions with erroneous encodings, that is, characters that occur but should not occur.
In this toy data, the only allowed characters are this class:
"[)(/][A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]"
df <- data.frame(
Utterance = c("~°maybe you (.) >should ¥just¥<",
"SOME text |<-- pipe¿ and€", # <--: | and €
"blah%", # <--: %
"text ^more text", # <--: ^
"£norm(hh)a::l£mal, (1.22)"))
What I need to do is:
- detect
Utterance
s that contain any wrong encodings - extract the wrong characters
I'm doing OK as far as detection is concerned but the extraction fails miserably:
library(stringr)
library(dplyr)
df %>%
filter(!str_detect(Utterance, "[)(/][A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]")) %>%
mutate(WrongChar = str_extract_all(Utterance, "[^)(/][A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]"))
Utterance WrongChar
1 SOME text |<-- pipe¿ and€ SO, ME, t, ex, |<, --, p, ip, e¿, a, nd
2 blah% bl, ah
3 text ^more text te, xt, ^m, or, t, ex
How can the extraction be improved to obtain this expected result:
Utterance WrongChar
1 SOME text |<-- pipe¿ and€ |, €
2 blah% %
3 text ^more text ^
CodePudding user response:
You need to
- Ensure the
[
and]
are escaped inside a character class - Add whitespace pattern to both regexp checks as its absence is messing your results.
So you need to use
df %>%
filter(str_detect(Utterance, "[^\\s)(/\\]\\[A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]")) %>%
mutate(WrongChar = str_extract_all(Utterance, "[^\\s)(/\\]\\[A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]"))
Output:
Utterance WrongChar
1 SOME text |<-- pipe¿ and€ |, €
2 blah% %
3 text ^more text ^
Note that I used positive logic in filter(str_detect(Utterance, "[^\\s)(/\\]\\[A-Za-z0-9↑↓£¥°!.,:¿?~<>≈=_-]"))
, so we get all items that contain at least one char other than an allowed one.