Remove a string that does not have a number after of it?-CodePudding

How can I remove strings that are not succeeded by numbers?

For example, I am working with string data like the one below:

String <- c("NA; ab 1917; ajr 69; sb 700; sb 703; scarl m; ab 1672 a",
"ab 18 sb 5 ab 1433 hdge; ab 1129 ab 184 ab 473 a",
"ab 3 16 31 41 1134 1206 abuht",
"ab 479 862 984 1626 asc")

df <- data.frame(String)

I would like the output to look like the following

Output <- c("NA; ab 1917; ajr 69; sb 700; sb 703;; ab 1672",
"ab 18 sb 5 ab 1433 ab 1129 ab 184 ab 473",
"ab 3 16 31 41 1134 1206",
"ab 479 862 984 1626")

df <- data.frame(String, Output)

Thank you so much for your help!

CodePudding user response：

Sorry I couldn't add my comment so I wrote my insufficient code here.

I agree with Chris's opinion.

I focused on "Output"'s first line and tried using ";" as the separator.

If you want to add the separator " "(white space), just modify the code.

String <- c("NA; ab 1917; ajr 69; sb 700; sb 703; scarl m; ab 1672 a",
"ab 18 sb 5 ab 1433 hdge; ab 1129 ab 184 ab 473 a",
"ab 3 16 31 41 1134 1206 abuht",
"ab 479 862 984 1626 asc")
res<-c()
for(str in String){
    hoge<-strsplit(str, ";")[[1]]
    res<-c(res, paste(hoge[grep("\\d|NA", hoge)], collapse=";"))
}
# ** this result is insufficient **
data.frame(res)
                                               res
1   NA; ab 1917; ajr 69; sb 700; sb 703; ab 1672 a
2 ab 18 sb 5 ab 1433 hdge; ab 1129 ab 184 ab 473 a
3                    ab 3 16 31 41 1134 1206 abuht
4                          ab 479 862 984 1626 asc

If you Edit your question, kind contributers will help you I think.

CodePudding user response：

First let's determine the regex:

succeed_num_regex = "(( )?.  [0-9] ) "

The meaning:

( )?: we allow (but don't require) a space at the beginning
. : some amount of free text (this is the "string" that is to be succeeded by a number)
: there must be a space after the string
[0-9] : this is the number
The whole thing is enclosed in () , meaning that we are looking for this pattern to repeat one or more times.

Now we can put this in code:

library(tidyverse)
String %>%
  str_split("; ") %>%
  map(map_chr, str_extract, pattern = succeed_num_regex) %>%
  # Strings that did not have this pattern at all will be NA
  # We replace them here with ""
  map(map_chr, function(x) ifelse(is.na(x), "", x)) %>%
  # Put it all back together
  map_chr(paste, collapse = "; ")

[1] "; ab 1917; ajr 69; sb 700; sb 703; ; ab 1672"
[2] "ab 18 sb 5 ab 1433; ab 1129 ab 184 ab 473"   
[3] "ab 3 16 31 41 1134 1206"                     
[4] "ab 479 862 984 1626"

Some notes:

In your output, you kept "NA" instead of it getting replaced with "", which is what later happened to "scarl m". This can be added as a rule to the solution, but for now I did not add it because it is not consistent with your requirements.
In your output, the second result "ab 18 sb 5 ab 1433 ab 1129 ab 184 ab 473" is missing a semi-colon after 1433. If that was not a mistake, then please explain why.
In your output, we have sb 703;; whereas my output has sb 703; ;. This is to be consistent that the results are pasted with "; ". Let me know if this is problematic (I left it as is since that isn't a clear requirement either).

CodePudding user response：

Using an ide like vscode or notepad , I can use this to match (\s)([a-z][a-z][a-z] ) and replace with this$1.

Your need is confusing as according to your output, 'ajr' is not supposed to be matched meanwhile 'asc' is matched. My hack above matches both 'ajr' and 'asc'.

A breakdown of my hack is:

(\s) matches the space before the group of letters. I noticed that you want to match only group of letters found after a space.
([a-z][a-z][a-z] ) matches groups of letters greater than 2 (as I noticed that you do not want to match 2 letter groups).
$1 replaces the match with the nothing.

I hope it helps. You can take this and translate it into the programming language you are using and there.