How can I remove strings that are not succeeded by numbers?
For example, I am working with string data like the one below:
String <- c("NA; ab 1917; ajr 69; sb 700; sb 703; scarl m; ab 1672 a",
"ab 18 sb 5 ab 1433 hdge; ab 1129 ab 184 ab 473 a",
"ab 3 16 31 41 1134 1206 abuht",
"ab 479 862 984 1626 asc")
df <- data.frame(String)
I would like the output to look like the following
Output <- c("NA; ab 1917; ajr 69; sb 700; sb 703;; ab 1672",
"ab 18 sb 5 ab 1433 ab 1129 ab 184 ab 473",
"ab 3 16 31 41 1134 1206",
"ab 479 862 984 1626")
df <- data.frame(String, Output)
Thank you so much for your help!
CodePudding user response:
Sorry I couldn't add my comment so I wrote my insufficient code here.
I agree with Chris's opinion.
I focused on "Output"'s first line and tried using ";" as the separator.
If you want to add the separator " "(white space), just modify the code.
String <- c("NA; ab 1917; ajr 69; sb 700; sb 703; scarl m; ab 1672 a",
"ab 18 sb 5 ab 1433 hdge; ab 1129 ab 184 ab 473 a",
"ab 3 16 31 41 1134 1206 abuht",
"ab 479 862 984 1626 asc")
res<-c()
for(str in String){
hoge<-strsplit(str, ";")[[1]]
res<-c(res, paste(hoge[grep("\\d|NA", hoge)], collapse=";"))
}
# ** this result is insufficient **
data.frame(res)
res
1 NA; ab 1917; ajr 69; sb 700; sb 703; ab 1672 a
2 ab 18 sb 5 ab 1433 hdge; ab 1129 ab 184 ab 473 a
3 ab 3 16 31 41 1134 1206 abuht
4 ab 479 862 984 1626 asc
If you Edit your question, kind contributers will help you I think.
CodePudding user response:
First let's determine the regex:
succeed_num_regex = "(( )?. [0-9] ) "
The meaning:
( )?
: we allow (but don't require) a space at the beginning.
: some amount of free text (this is the "string" that is to be succeeded by a number)[0-9]
: this is the numberThe whole thing is enclosed in
()
, meaning that we are looking for this pattern to repeat one or more times.
Now we can put this in code:
library(tidyverse)
String %>%
str_split("; ") %>%
map(map_chr, str_extract, pattern = succeed_num_regex) %>%
# Strings that did not have this pattern at all will be NA
# We replace them here with ""
map(map_chr, function(x) ifelse(is.na(x), "", x)) %>%
# Put it all back together
map_chr(paste, collapse = "; ")
[1] "; ab 1917; ajr 69; sb 700; sb 703; ; ab 1672"
[2] "ab 18 sb 5 ab 1433; ab 1129 ab 184 ab 473"
[3] "ab 3 16 31 41 1134 1206"
[4] "ab 479 862 984 1626"
Some notes:
In your output, you kept
"NA"
instead of it getting replaced with""
, which is what later happened to"scarl m"
. This can be added as a rule to the solution, but for now I did not add it because it is not consistent with your requirements.In your output, the second result
"ab 18 sb 5 ab 1433 ab 1129 ab 184 ab 473"
is missing a semi-colon after1433
. If that was not a mistake, then please explain why.In your output, we have
sb 703;;
whereas my output hassb 703; ;
. This is to be consistent that the results are pasted with"; "
. Let me know if this is problematic (I left it as is since that isn't a clear requirement either).
CodePudding user response:
Using an ide like vscode or notepad , I can use this to match (\s)([a-z][a-z][a-z] )
and replace with this$1
.
Your need is confusing as according to your output, 'ajr' is not supposed to be matched meanwhile 'asc' is matched. My hack above matches both 'ajr' and 'asc'.
A breakdown of my hack is:
(\s)
matches the space before the group of letters. I noticed that you want to match only group of letters found after a space.([a-z][a-z][a-z] )
matches groups of letters greater than 2 (as I noticed that you do not want to match 2 letter groups).$1
replaces the match with the nothing.
I hope it helps. You can take this and translate it into the programming language you are using and there.