Can please someone help me to set up a regular expression.
I have a large LaTeX3 TeXDoc file. LaTeX3 TeXDoc defines the macro \TestFiles{}
, which should be used, to list the names of files, which itself should be used as an unit tests. You can name more than one file between the braces. So \TestFiles{foo-bar}
and \TestFiles{foo-bar, bar baz,foo_bar_baz}
are syntactical correct use cases for this macro.
I would like to write a bash script, to extract all the uni test files, named in the \TestFiles{}
macros, to compile them with pdflatex
and check, if pdflatex
will be able to produce an output file successfully.
I have something like this in my script:
function get_filenames () {
## This regex works but is not sensible enough
# regex='\\TestFiles{(.*)}'
## This works also, but is again not precise enough
regex='\\TestFiles{([0-9a-zA-Z -_, ]*)}'
## This should give more than one matching group
## (separated by ", " or ","), but this regex doesn't
## work. I have no idea why or how to modify, to get
## it working
while read -r line ; do
if [[ $line =~ $regex ]] ; then
i=1
while [ $i -le 3 ]; do
echo "Match $i: \"${BASH_REMATCH[$i]}\""
i=$(( i 1 ))
done
echo
fi
done < mystyle.dtx
}
Here is an excerpt of the DTX file
\TestFiles{foo-bar}
\TestFiles{foo-bar, bar baz,foo_bar_baz}
(You can store this as mystyle.dtx, in order to reproduce the next example.)
Using the above noted examples, my script gives me the following results:
get_filenames
Match 1: "foo-bar"
Match 2: ""
Match 3: ""
Match 1: "foo-bar, bar baz,foo_bar_baz"
Match 2: ""
Match 3: ""
I wasn't able, to modify my regex
expression, to split the content of the last \TestFiles{foo-bar, bar baz,foo_bar_baz}
example into three matching results.
I tried a regular expression like this regex='\\TestFiles{([[:alnum:] -_]*)[,] [ ]*}'
. I thought the [:alnum:] -_]*
would match the filenames. As far as I understand regular expressions, the (...)
should form a group, that should be listed afterwards in the bash array BASH_REMATCH[$i]
.
The part [,]
should reflect that every file name must be separated by at least one comma. Between the filenames there might be some white space, so something like [[:space:]]*
or at least [ ]*
should represent this. The quantifier *
means any repetition, ranging from 0 to ..., while
should at least appear one or more times.
But that regular expression did not work at all, if had no matching results.
How must regex
be defined, to store each filenames as a matching group? I am searching for the correct regular expression, to get this result:
get_filenames
Match 1: "foo-bar"
Match 2: ""
Match 3: ""
Match 1: "foo-bar"
Match 2: "bar baz"
Match 3: "foo_bar_baz"
EDIT: in my real world files, there may be (and are) more than tree test files.
Thanks in advance.
CodePudding user response:
You can set IFS=', '
and have bash do the splitting for you.
line='\TestFiles{foo-bar, bar baz,foo_bar_baz}'
[[ $line = \\TestFiles{* ]] && {
# Remove leading '\Testfiles{'
# Remove trailing }
line=${line#*{}
line=${line%}}
IFS=', ' read -a filenames <<< "$line"
declare -p filenames
}
declare -a filenames=([0]="foo-bar" [1]="bar baz" [2]="foo_bar_baz}}")
CodePudding user response:
I believe this is the regular expression you're looking for:
(?<=\\TestFiles{.*)([\w\d\-\ _] )[, }]
You can see it working, modify it and have an explanation on what it does in the following link: https://regex101.com/r/0W8PBi/1
CodePudding user response:
Use set
with IFS
to split each line into new positional parameters. Assign $@
to an array so that elements can be accessed by index. Trying this with $@
directly results in a bad substitution
error.
get-filenames.sh
#!/usr/bin/env bash
get_filenames() {
local IFS=' {},'
declare -a names
while read -r line; do
set -- $line
names=($@)
test "${names[0]}" == '\TestFiles' && {
for i in {1..3}; do
printf "Match %i: \"%s\"\n" $i ${names[$i]}
done
}
echo
done < 'mystyle.dtx'
}
get_filenames
mystyle.dtx
\TestFiles{foo-bar} \TestFiles{foo-bar, bar baz,foo_bar_baz}
output
Match 1: "foo-bar" Match 2: "" Match 3: "" Match 1: "foo-bar" Match 2: "bar baz" Match 3: "foo_bar_baz"
CodePudding user response:
EDIT (without external programs, though it's rather impractical, and tied to exactly three matches)
function get_filenames () {
p='([^, }]*) *,? *'
regex="\\TestFiles\{$p$p$p"
while read -r line ; do
if [[ $line =~ $regex ]] ; then
i=1
while [ $i -le 3 ]; do
echo "Match $i: \"${BASH_REMATCH[$i]}\""
i=$(( i 1 ))
done
echo
fi
done < mystyle.dtx
}
If you really need to output exactly three file names (even empty) for each '\TestFiles' row then here's the code.
function get_filenames () {
MAX_FILES_CNT=3
IFS=$'\n'
for line in $(grep -oP '\\TestFiles\{\K[^}]*' < mystyle.dtx); do
filenames=()
for filename in $(grep -m $MAX_FILES_CNT -oP "[^, ] " <<< "$line"); do
filenames =("$filename")
done
i=0
while [ $i -lt $MAX_FILES_CNT ]; do
echo "Match $(($i 1)): \"${filenames[i]}\""
i=$(( i 1 ))
done
echo ""
done
unset IFS
}
Match 1: "foo-bar"
Match 2: ""
Match 3: ""
Match 1: "foo-bar"
Match 2: "bar baz"
Match 3: "foo_bar_baz"
By the way, BASH_REMATCH is no good for this task, cause it captures only last rematch. Look
[[ "asdf" =~ (.)* ]]
echo "${BASH_REMATCH[@]}"
asdf f
Also I would recommend to read this question https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice