Home > Software design >  How to extract all strings matching a regex (assuming no overlapping between the matched strings)?
How to extract all strings matching a regex (assuming no overlapping between the matched strings)?

Time:04-25

I want to extract all strings matching a regex in a text. I come up with the following code in awk.

I am wondering if there is any more efficient and more concise way to capture strings matching a regex in a text.

[0-9] is just a test, it can be an arbitrary regex. The input string can also be an arbitrary text. The solution can be any command line solution in linux, other tools like sed, can also be used, as I am looking for a superior solution than the current one. Language like python, bash, is also fine, as long as the code is not too long and can be taken inline.

awk -v regex='[0-9]' -e '{
 line = $0
 while(match(line, regex)) {
  res[substr(line, RSTART, RLENGTH)] = ""
  line = substr(line, RSTART RLENGTH)
 }
}
END {
  for(k in res) print k
}' <<< 'a1b2c3'

CodePudding user response:

With GNU awk and its FPAT variable/feature:

$ awk -v regex='[0-9]' 'BEGIN {FPAT=regex} {for (i=1;i<=NF;i  ) print $i}' <<< 'a1b2c3'
1
2
3

FPAT=... defines the actual field(2) (as opposed to FS which defines the field delimiter).

CodePudding user response:

a python solution:-

python -c "import re, sys; print('\n'.join(re.compile(sys.argv[1]).findall(sys.argv[2])))" "__regx_pattern__" "__text__"

example:
python -c "import re, sys; print('\n'.join(re.compile(sys.argv[1]).findall(sys.argv[2])))" "hi_[0-9]*" "hi_1 hi_2"
output:

hi_1
hi_2

CodePudding user response:

If all you need to do is print every match, then grep should be enough.

$ echo a1b2c3 | rg '[0-9]' --only-matching
1
2
3

rg is ripgrep, an "improved" version of standard grep. You can do the same task with grep as well, but I find rg's syntax to be more ergonomic than grep.

CodePudding user response:

How about a perl solution:

perl -lne 'print for /(\d)/g' <<< 'a1b2c3'
  • -ne option is mostly equivalent to that of sed.
  • -l option appends a newline to each output of print.
  • print for /(regex)/g constracts a loop to print all matched substrings.

CodePudding user response:

It sounds like grep -o regexp file is all you need.

e.g.:

$ grep -o '[0-9]' <<< 'a1b2c3'
1
2
3
  • Related