In this text, I'm trying to extract all of the titles and page numbers of lines that:
- Begin with -
- Followed by whitespace
- Then a chapter title
- Then the page #, like so: #page=9&
Here's an example of the raw text I'm working with:
| "Principles of Microeconomics 2e" (null)
- "Preface" #page=9&zoom=0,0,58
| "1. About OpenStax" #page=9&zoom=0,0,150
| "2. About OpenStax resources" #page=9&zoom=0,0,248
| "3. About Principles of Microeconomics 2e" #page=9&zoom=0,0,544
| "4. Additional resources" #page=13&zoom=0,0,367
| "5. About the authors" #page=14&zoom=0,0,58
- "Chapter 1. Welcome to Economics!" #page=17&zoom=0,0,58
| "1.1. What Is Economics, and Why Is It Important?*" #page=18&zoom=0,0,338
| "1.2. Microeconomics and Macroeconomics*" #page=22&zoom=0,0,448
| "1.3. How Economists Use Theories and Models to Understand Economic Issues*" #page=23&zoom=0,0,565
My current regex:
(?:\-\s)(["'])(?:(?=(\\?))\2.)*?(?:page=)(?<=\=)(.*?)(?=\&)
Currently using this regex matches the entire line, but doesn't put the desired elements into seperates groups.. And I'm having trouble doing this seperation.
current output: current output
desired output:
Match 1: "Preface" #page=9&
Group 1: Preface
Group 2: 9
Match 2: "Chapter 1. Welcome to Economics!" #page=17&
Group 1: Chapter 1. Welcome to Economics!
Group 2: 17
I am trying to extract the title in one group, and the page number in another. How can I do this?
CodePudding user response:
Start with
sed -En 's/^-.*"([^"] )".*page=([[:digit:]] ).*/\1\n\2/p' file
Preface
9
Chapter 1. Welcome to Economics!
17
sed does not use PCRE, so you don't have non-capturing parentheses or lookarounds, etc
Update to address the new desired output.
sed, not having arbitrary variables, will have a hard time counting the matches. Using GNU awk with the same regex:
gawk '
match($0, /^-.*"([^"] )".*page=([[:digit:]] )/, m) {
printf "Match %d: %s\n", n, m[0]
printf "Group 1: %s\n", m[1]
printf "Group 2: %s\n\n", m[2]
}
' file
Match 1: - "Preface" #page=9
Group 1: Preface
Group 2: 9
Match 2: - "Chapter 1. Welcome to Economics!" #page=17
Group 1: Chapter 1. Welcome to Economics!
Group 2: 17
GNU awk is required for the 3-argument match()
function.
CodePudding user response:
With GNU awk:
awk 'BEGIN{ FPAT="\".*\"|#page=[0-9] &" }
/^- /{
print "Match", m ":", $1, $2;
gsub(/"/, "", $1); # remove all "
print "Group 1:", $1;
gsub(/[^0-9] /, "", $2); # remove all but numbers
print "Group 2:", $2;
print ""
}' file
Output:
Match 1: "Preface" #page=9& Group 1: Preface Group 2: 9 Match 2: "Chapter 1. Welcome to Economics!" #page=17& Group 1: Chapter 1. Welcome to Economics! Group 2: 17
FPAT
: A regular expression describing the contents of the fields in a record.
CodePudding user response:
Using sed
$ sed -n '/[^0-9]*.\./s/-[^"]*\("\)\([^#]*\)\("\)\([^0-9]*\([0-9]*\)&\).*/Match 2: \1\2\3\4\nGroup 1: \2\nGroup 2: \5/p
;/-/s/-[^"]*\("\)\([^#]*\)\("\)\([^0-9]*\([0-9]*\)&\).*/Match 1: \1\2\3\4\nGroup 1: \2\nGroup 2: \5\n/p' input_file
Output
Match 1: "Preface" #page=9&
Group 1: Preface
Group 2: 9
Match 2: "Chapter 1. Welcome to Economics!" #page=17&
Group 1: Chapter 1. Welcome to Economics!
Group 2: 17