Home > Software design >  Processing a specific part of a text according to pattern from AWK script
Processing a specific part of a text according to pattern from AWK script

Time:10-22

Im developing a script in awk to convert a tex document into html, according to my preferences.

#!/bin/awk -f

BEGIN {
    FS="\n";
    print "<html><body>"
}
# Function to print a row with one argument to handle either a 'th' tag or 'td' tag
function printRow(tag) {
    for(i=1; i<=NF; i  ) print "<"tag">"$i"</"tag">";
}

NR>1 {
   [conditions]
   printRow("p")
}

END {
    print "</body></html>"
}

Its in a very young stage of development, as seen.

\documentclass[a4paper, 11pt, titlepage]{article}
\usepackage{fancyhdr}
\usepackage{graphicx}
\usepackage{imakeidx}
[...]

\begin{document}

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla placerat lectus sit amet augue facilisis, eget viverra sem pellentesque. Nulla vehicula metus risus, vel condimentum nunc dignissim eget. Vivamus quis sagittis tellus, eget ullamcorper libero. Nulla vitae fringilla nunc. Vivamus id suscipit mi. Phasellus porta lacinia dolor, at congue eros rhoncus vitae. Donec vel condimentum sapien. Curabitur est massa, finibus vel iaculis id, dignissim nec nisl. Sed non justo orci. Morbi quis orci efficitur sem porttitor pulvinar. Duis consectetur rhoncus posuere. Duis cursus neque semper lectus fermentum rhoncus.

\end{document}

What I want, is that the script only interprets the lines that are between \begin{document} and \end{document}, since before they are imports of libraries, variables, etc; which at the moment do not interest me.

How do I make it so that it only processes the text within that pattern?

CodePudding user response:

GNU AWK has feature called Range when you provide two conditions sheared by , then action will be applied only between lines with these conditions (including these lines), consider following simple example, let file.txt content be

junk
\begin{document}
desired text
more desired text
\end{document}
more junk

then

awk '$0=="\\begin{document}",$0=="\\end{document}"{print}' file.txt

gives output

\begin{document}
desired text
more desired text
\end{document}

(tested in gawk 4.2.1)

CodePudding user response:

Use a regex to set a flag and then print based on that flag:

awk '/^\\begin{document}/{flag=1} 
flag
/^\\end{document}/{flag=0}' file

That print everything between the start and ending strings inclusive:

\begin{document}

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla placerat lectus sit amet augue facilisis, eget viverra sem pellentesque. Nulla vehicula metus risus, vel condimentum nunc dignissim eget. Vivamus quis sagittis tellus, eget ullamcorper libero. Nulla vitae fringilla nunc. Vivamus id suscipit mi. Phasellus porta lacinia dolor, at congue eros rhoncus vitae. Donec vel condimentum sapien. Curabitur est massa, finibus vel iaculis id, dignissim nec nisl. Sed non justo orci. Morbi quis orci efficitur sem porttitor pulvinar. Duis consectetur rhoncus posuere. Duis cursus neque semper lectus fermentum rhoncus.

\end{document}

If you only want the text between and not including the start and end strings:

awk '
/^\\begin{document}/{flag=1; next} 
/^\\end{document}/{flag=0}
flag' file

Prints:

# leading blank line printed...
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla placerat lectus sit amet augue facilisis, eget viverra sem pellentesque. Nulla vehicula metus risus, vel condimentum nunc dignissim eget. Vivamus quis sagittis tellus, eget ullamcorper libero. Nulla vitae fringilla nunc. Vivamus id suscipit mi. Phasellus porta lacinia dolor, at congue eros rhoncus vitae. Donec vel condimentum sapien. Curabitur est massa, finibus vel iaculis id, dignissim nec nisl. Sed non justo orci. Morbi quis orci efficitur sem porttitor pulvinar. Duis consectetur rhoncus posuere. Duis cursus neque semper lectus fermentum rhoncus.
# ending blank line printed...
  • Related