Home > front end >  How to extract two separate groups in regex?
How to extract two separate groups in regex?


In this text, I'm trying to extract all of the titles and page numbers of lines that:

  1. Begin with -
  2. Followed by whitespace
  3. Then a chapter title
  4. Then the page #, like so: #page=9&

Here's an example of the raw text I'm working with:

|   "Principles of Microeconomics 2e"   (null)
-   "Preface"   #page=9&zoom=0,0,58
|       "1. About OpenStax" #page=9&zoom=0,0,150
|       "2. About OpenStax resources"   #page=9&zoom=0,0,248
|       "3. About Principles of Microeconomics 2e"  #page=9&zoom=0,0,544
|       "4. Additional resources"   #page=13&zoom=0,0,367
|       "5. About the authors"  #page=14&zoom=0,0,58
-   "Chapter 1. Welcome to Economics!"  #page=17&zoom=0,0,58
|       "1.1. What Is Economics, and Why Is It Important?*" #page=18&zoom=0,0,338
|       "1.2. Microeconomics and Macroeconomics*"   #page=22&zoom=0,0,448
|       "1.3. How Economists Use Theories and Models to Understand Economic Issues*"    #page=23&zoom=0,0,565

My current regex: (?:\-\s)(["'])(?:(?=(\\?))\2.)*?(?:page=)(?<=\=)(.*?)(?=\&)

Currently using this regex matches the entire line, but doesn't put the desired elements into seperates groups.. And I'm having trouble doing this seperation.

current output: current output

desired output:

Match 1: "Preface" #page=9&
Group 1: Preface
Group 2: 9

Match 2: "Chapter 1. Welcome to Economics!"  #page=17&
Group 1: Chapter 1. Welcome to Economics!
Group 2: 17

I am trying to extract the title in one group, and the page number in another. How can I do this?

CodePudding user response:

Start with

sed -En 's/^-.*"([^"] )".*page=([[:digit:]] ).*/\1\n\2/p' file
Chapter 1. Welcome to Economics!

sed does not use PCRE, so you don't have non-capturing parentheses or lookarounds, etc

Update to address the new desired output.

sed, not having arbitrary variables, will have a hard time counting the matches. Using GNU awk with the same regex:

gawk '
    match($0, /^-.*"([^"] )".*page=([[:digit:]] )/, m) {
        printf "Match %d: %s\n",   n, m[0]
        printf "Group 1: %s\n", m[1]
        printf "Group 2: %s\n\n", m[2]
' file
Match 1: -   "Preface"   #page=9
Group 1: Preface
Group 2: 9

Match 2: -   "Chapter 1. Welcome to Economics!"  #page=17
Group 1: Chapter 1. Welcome to Economics!
Group 2: 17

GNU awk is required for the 3-argument match() function.

CodePudding user response:

With GNU awk:

awk 'BEGIN{ FPAT="\".*\"|#page=[0-9] &" }
     /^- /{
       print "Match",   m ":", $1, $2;

       gsub(/"/, "", $1);               # remove all "
       print "Group 1:", $1;

       gsub(/[^0-9] /, "", $2);         # remove all but numbers
       print "Group 2:", $2;

       print ""
     }' file


Match 1: "Preface" #page=9&
Group 1: Preface
Group 2: 9

Match 2: "Chapter 1. Welcome to Economics!" #page=17&
Group 1: Chapter 1. Welcome to Economics!
Group 2: 17

FPAT: A regular expression describing the contents of the fields in a record.

CodePudding user response:

Using sed

$ sed -n '/[^0-9]*.\./s/-[^"]*\("\)\([^#]*\)\("\)\([^0-9]*\([0-9]*\)&\).*/Match 2: \1\2\3\4\nGroup 1: \2\nGroup 2: \5/p
;/-/s/-[^"]*\("\)\([^#]*\)\("\)\([^0-9]*\([0-9]*\)&\).*/Match 1: \1\2\3\4\nGroup 1: \2\nGroup 2: \5\n/p' input_file


Match 1: "Preface"   #page=9&
Group 1: Preface
Group 2: 9

Match 2: "Chapter 1. Welcome to Economics!"  #page=17&
Group 1: Chapter 1. Welcome to Economics!
Group 2: 17
  • Related