Home > Net >  Why doesn't the reg exp that works in regex101 dot com, does NOT work with grep nor with vim in
Why doesn't the reg exp that works in regex101 dot com, does NOT work with grep nor with vim in

Time:01-02

I have an ascii file in the format below:

gc_ab_cd               92641.48   25.2    5.12  9.20   0.00  gc_ht_t_gc_ab_cd
gc_ab_cd/reg              29.24    0.0    0.49  0.01   0.00  gc_ht_t_CHECK1_0
gc_ab_cd/reg/dff_in_gated 17.13    0.0    6.00  11.13  0.00  gc_ht_t_dff_en_in_WIDTH84_0
gc_ab_cd/reg/dff_in_send_gated 0.20  0.0  0.00   0.20  0.00  gc_ht_t_dff_in_WIDTH1_33
gc_ab_cd/reg/rd_rtn       11.42    0.0    4.20   7.22  0.00  gc_ht_t_gfx_2toN_WIDTH32_1
gc_ab_cd/regs          18583.88    5.1 2958.87  25.01  0.00  gc_ht_t_gc_ab_cd_regs
gc_ab_cd/tap_ch          431.51    0.1  144.83 150.05  0.00  gc_ht_t_gc_vm2_qe128
gc_ab_cd/tap_ch/throttle 136.63    0.0   77.33  59.30  0.00  gc_ht_t_gc_vm2__throttle
gc_ab_cd/vm2_dbg          22.79    0.0    0.00   0.00  0.00  gc_ht_t_gfx_dbg_mux_01
gc_ab_cd/vm2_dbg/bg_mux   22.79    0.0    9.90   4.80  0.00  gc_ht_t_gc_dbg_mux_4_1_01
gc_ab_cd/vm2_dbg/bg_mux/clk  0.20  0.0    0.00   0.20  0.00  gc_ht_t_clock
gc_ab_cd/vm2_dbg/bg__mux/flop_mux_flop 5.33 0.0 2.63 2.70 0.00 gc_ht_t_dbg_COUNT4_WIDTH8_0

I need to grep 0 or 1 level of the hierarchy of the first field in the above text, so that the output of the "grep" should print the below in the stdout

gc_ab_cd               92641.48   25.2    5.12  9.20   0.00  gc_ht_t_gc_ab_cd
gc_ab_cd/reg              29.24    0.0    0.49  0.01   0.00  gc_ht_t_CHECK1_0
gc_ab_cd/regs          18583.88    5.1 2958.87  25.01  0.00  gc_ht_t_gc_ab_cd_regs
gc_ab_cd/tap_ch          431.51    0.1  144.83 150.05  0.00  gc_ht_t_gc_vm2_qe128
gc_ab_cd/vm2_dbg          22.79    0.0    0.00   0.00  0.00  gc_ht_t_gfx_dbg_mux_01

I used the regexp https://regex101.com/r/D92KSP/1

But it gives only 3 matches below (1 level of hierarchy in the first field), as can be seen in https://regex101.com/r/D92KSP/1

gc_ab_cd/reg              29.24    0.0    0.49  0.01   0.00  gc_ht_t_CHECK1_0
gc_ab_cd/regs          18583.88    5.1 2958.87  25.01  0.00  gc_ht_t_gc_ab_cd_regs
gc_ab_cd/tap_ch          431.51    0.1  144.83 150.05  0.00  gc_ht_t_gc_vm2_qe128

Questions:

[1] I'm NOT sure why the below line (0 hierarchy in the first field) is NOT being matched by the regexp in https://regex101.com/r/D92KSP/1

gc_ab_cd               92641.48   25.2    5.12  9.20   0.00  gc_ht_t_gc_ab_cd

[2] What should I do to modify the regexp https://regex101.com/r/D92KSP/1 to match the line below

gc_ab_cd/vm2_dbg          22.79    0.0    0.00   0.00  0.00  gc_ht_t_gfx_dbg_mux_01

[3] I used the above regexp with "grep" and in the vim editor in Linux and it doesn't work there, though it works partially in regexp101.com. Don't know why?

CodePudding user response:

regex101 and other such web sites will help you create/validate a regexp that works on that web site, don't assume it'll work anywhere else, especially the mandatory POSIX command-line tools like sed, grep, and awk as each tool uses specific regexp variants (BRE, ERE, and/or PCRE) with different arguments (e.g. -E to enable EREs in grep and sed, -P to enable PCREs in grep with some caveats), extensions (e.g. word boundaries, shortcuts, or back references), and limitations (e.g. delimiter chars). You have to learn which regexp variant with which extensions and limitations the version (e.g. GNU or BSD) of the tool you want to use supports.

In any case, any time you're talking about fields you should be using awk, not grep (or sed) since awk is the tool that separates input into fields. The following will work using any awk in any shell on every Unix box:

$ awk '$1 ~ "^[^/]*/?[^/]*$"' file
gc_ab_cd               92641.48   25.2    5.12  9.20   0.00  gc_ht_t_gc_ab_cd
gc_ab_cd/reg              29.24    0.0    0.49  0.01   0.00  gc_ht_t_CHECK1_0
gc_ab_cd/regs          18583.88    5.1 2958.87  25.01  0.00  gc_ht_t_gc_ab_cd_regs
gc_ab_cd/tap_ch          431.51    0.1  144.83 150.05  0.00  gc_ht_t_gc_vm2_qe128
gc_ab_cd/vm2_dbg          22.79    0.0    0.00   0.00  0.00  gc_ht_t_gfx_dbg_mux_01

or to search for a specific path depth by just setting a numeric variable on the command line:

$ awk -v n=2 '{key=$1} gsub("/","&",key)<n' file
gc_ab_cd               92641.48   25.2    5.12  9.20   0.00  gc_ht_t_gc_ab_cd
gc_ab_cd/reg              29.24    0.0    0.49  0.01   0.00  gc_ht_t_CHECK1_0
gc_ab_cd/regs          18583.88    5.1 2958.87  25.01  0.00  gc_ht_t_gc_ab_cd_regs
gc_ab_cd/tap_ch          431.51    0.1  144.83 150.05  0.00  gc_ht_t_gc_vm2_qe128
gc_ab_cd/vm2_dbg          22.79    0.0    0.00   0.00  0.00  gc_ht_t_gfx_dbg_mux_01

$ awk -v n=3 '{key=$1} gsub("/","&",key)<n' file
gc_ab_cd               92641.48   25.2    5.12  9.20   0.00  gc_ht_t_gc_ab_cd
gc_ab_cd/reg              29.24    0.0    0.49  0.01   0.00  gc_ht_t_CHECK1_0
gc_ab_cd/reg/dff_in_gated 17.13    0.0    6.00  11.13  0.00  gc_ht_t_dff_en_in_WIDTH84_0
gc_ab_cd/reg/dff_in_send_gated 0.20  0.0  0.00   0.20  0.00  gc_ht_t_dff_in_WIDTH1_33
gc_ab_cd/reg/rd_rtn       11.42    0.0    4.20   7.22  0.00  gc_ht_t_gfx_2toN_WIDTH32_1
gc_ab_cd/regs          18583.88    5.1 2958.87  25.01  0.00  gc_ht_t_gc_ab_cd_regs
gc_ab_cd/tap_ch          431.51    0.1  144.83 150.05  0.00  gc_ht_t_gc_vm2_qe128
gc_ab_cd/tap_ch/throttle 136.63    0.0   77.33  59.30  0.00  gc_ht_t_gc_vm2__throttle
gc_ab_cd/vm2_dbg          22.79    0.0    0.00   0.00  0.00  gc_ht_t_gfx_dbg_mux_01
gc_ab_cd/vm2_dbg/bg_mux   22.79    0.0    9.90   4.80  0.00  gc_ht_t_gc_dbg_mux_4_1_01

CodePudding user response:

Edit: as mentioned by Ed Morton in the comments, this answer assumes the use of GNU grep in multiple places.


Looking only at the start of your regex, you are using

gc_[a-z_] [\s |\/][a-z_] 

which means:

  1. 'gc_' followed by 1 a-z_ characters
  2. whitespace, or |, or /
  3. 1 'a-z_' characters

0-levels rows do not match the 3rd part.

What you want is probably something like

gc_[a-z_] (\s |\/[a-z_] )

which is:

  1. 'gc_' followed by 1 a-z_ characters
  2. either whitespace OR / followed by a-z_ characters (level 1)

For the second question vm2_dbg is not matched since it contains 2, which isn't in [a-z_]. Either use [a-z0-9_], or use \w if uppercase characters are also acceptable (note: this may not work on non-GNU grep).


The regex won't work with grep by default as it doesn't use PCRE regexes. You can use grep -E or grep -P here, though -P is probably best if you're trying out your regex on PCRE engines (note: -P is only available when using GNU grep, it won't work on e.g. Mac OS which uses BSD grep).
Vim uses it's own engine which isn't PCRE compatible, though since your regex is fairly simple using \v (see :help /magic) will make it work in vim too, eg

/\vgc_[a-z_] (\s |\/[a-z_] )\s [0-9] \.[0-9] .*

or

:s/\vgc_[a-z_] (\s |\/[a-z_] )\s [0-9] \.[0-9] .*/replacement/

If you only want to print the "0 or 1 level" lines though you can simplify your regex to something like (assuming uppercase characters are ok)

^gc_\w (\/\w )?\s

which is

  1. ^gc_\w to match the 0th level (^ to match only at the start of a line)
  2. (\/\w )? to allow for an optional (? is "0 or 1") second level
  • Related