How to extract two separate groups in regex?-CodePudding

In this text, I'm trying to extract all of the titles and page numbers of lines that:

Begin with -
Followed by whitespace
Then a chapter title
Then the page #, like so: #page=9&

Here's an example of the raw text I'm working with:

|   "Principles of Microeconomics 2e"   (null)
-   "Preface"   #page=9&zoom=0,0,58
|       "1. About OpenStax" #page=9&zoom=0,0,150
|       "2. About OpenStax resources"   #page=9&zoom=0,0,248
|       "3. About Principles of Microeconomics 2e"  #page=9&zoom=0,0,544
|       "4. Additional resources"   #page=13&zoom=0,0,367
|       "5. About the authors"  #page=14&zoom=0,0,58
-   "Chapter 1. Welcome to Economics!"  #page=17&zoom=0,0,58
|       "1.1. What Is Economics, and Why Is It Important?*" #page=18&zoom=0,0,338
|       "1.2. Microeconomics and Macroeconomics*"   #page=22&zoom=0,0,448
|       "1.3. How Economists Use Theories and Models to Understand Economic Issues*"    #page=23&zoom=0,0,565

My current regex: (?:\-\s)(["'])(?:(?=(\\?))\2.)*?(?:page=)(?<=\=)(.*?)(?=\&)

Currently using this regex matches the entire line, but doesn't put the desired elements into seperates groups.. And I'm having trouble doing this seperation.

current output: current output

desired output:

Match 1: "Preface" #page=9&
Group 1: Preface
Group 2: 9

Match 2: "Chapter 1. Welcome to Economics!"  #page=17&
Group 1: Chapter 1. Welcome to Economics!
Group 2: 17

I am trying to extract the title in one group, and the page number in another. How can I do this?

CodePudding user response：

Start with

sed -En 's/^-.*"([^"] )".*page=([[:digit:]] ).*/\1\n\2/p' file

Preface
9
Chapter 1. Welcome to Economics!
17

sed does not use PCRE, so you don't have non-capturing parentheses or lookarounds, etc

Update to address the new desired output.

sed, not having arbitrary variables, will have a hard time counting the matches. Using GNU awk with the same regex:

gawk '
    match($0, /^-.*"([^"] )".*page=([[:digit:]] )/, m) {
        printf "Match %d: %s\n",   n, m[0]
        printf "Group 1: %s\n", m[1]
        printf "Group 2: %s\n\n", m[2]
    }
' file

Match 1: -   "Preface"   #page=9
Group 1: Preface
Group 2: 9

Match 2: -   "Chapter 1. Welcome to Economics!"  #page=17
Group 1: Chapter 1. Welcome to Economics!
Group 2: 17

GNU awk is required for the 3-argument match() function.

CodePudding user response：

With GNU awk:

awk 'BEGIN{ FPAT="\".*\"|#page=[0-9] &" }
     /^- /{
       print "Match",   m ":", $1, $2;

       gsub(/"/, "", $1);               # remove all "
       print "Group 1:", $1;

       gsub(/[^0-9] /, "", $2);         # remove all but numbers
       print "Group 2:", $2;

       print ""
     }' file

Output:

Match 1: "Preface" #page=9&
Group 1: Preface
Group 2: 9

Match 2: "Chapter 1. Welcome to Economics!" #page=17&
Group 1: Chapter 1. Welcome to Economics!
Group 2: 17

FPAT: A regular expression describing the contents of the fields in a record.

CodePudding user response：

Using sed

$ sed -n '/[^0-9]*.\./s/-[^"]*\("\)\([^#]*\)\("\)\([^0-9]*\([0-9]*\)&\).*/Match 2: \1\2\3\4\nGroup 1: \2\nGroup 2: \5/p
;/-/s/-[^"]*\("\)\([^#]*\)\("\)\([^0-9]*\([0-9]*\)&\).*/Match 1: \1\2\3\4\nGroup 1: \2\nGroup 2: \5\n/p' input_file

Output

Match 1: "Preface"   #page=9&
Group 1: Preface
Group 2: 9

Match 2: "Chapter 1. Welcome to Economics!"  #page=17&
Group 1: Chapter 1. Welcome to Economics!
Group 2: 17