Home > database >  Extracting filesnames in bash with regex
Extracting filesnames in bash with regex

Time:02-14

Can please someone help me to set up a regular expression.

I have a large LaTeX3 TeXDoc file. LaTeX3 TeXDoc defines the macro \TestFiles{}, which should be used, to list the names of files, which itself should be used as an unit tests. You can name more than one file between the braces. So \TestFiles{foo-bar} and \TestFiles{foo-bar, bar baz,foo_bar_baz} are syntactical correct use cases for this macro.

I would like to write a bash script, to extract all the uni test files, named in the \TestFiles{} macros, to compile them with pdflatex and check, if pdflatex will be able to produce an output file successfully.

I have something like this in my script:

function get_filenames () {
  ## This regex works but is not sensible enough
  # regex='\\TestFiles{(.*)}'
  ## This works also, but is again not precise enough
  regex='\\TestFiles{([0-9a-zA-Z -_, ]*)}'
  ## This should give more than one matching group 
  ## (separated by ", " or ","), but this regex doesn't 
  ## work.  I have no idea why or how to modify, to get 
  ## it working
  
  while read -r line ; do
    if [[ $line =~ $regex ]] ; then
      i=1
        while [ $i -le 3 ]; do
          echo "Match $i: \"${BASH_REMATCH[$i]}\""
          i=$(( i   1 ))
        done
      echo
    fi
  done < mystyle.dtx
}

Here is an excerpt of the DTX file

\TestFiles{foo-bar}

\TestFiles{foo-bar, bar baz,foo_bar_baz}

(You can store this as mystyle.dtx, in order to reproduce the next example.)

Using the above noted examples, my script gives me the following results:

get_filenames
Match 1: "foo-bar"
Match 2: ""
Match 3: ""

Match 1: "foo-bar, bar baz,foo_bar_baz"
Match 2: ""
Match 3: ""

I wasn't able, to modify my regex expression, to split the content of the last \TestFiles{foo-bar, bar baz,foo_bar_baz} example into three matching results.

I tried a regular expression like this regex='\\TestFiles{([[:alnum:] -_]*)[,] [ ]*}'. I thought the [:alnum:] -_]* would match the filenames. As far as I understand regular expressions, the (...) should form a group, that should be listed afterwards in the bash array BASH_REMATCH[$i].

The part [,] should reflect that every file name must be separated by at least one comma. Between the filenames there might be some white space, so something like [[:space:]]* or at least [ ]* should represent this. The quantifier * means any repetition, ranging from 0 to ..., while should at least appear one or more times.

But that regular expression did not work at all, if had no matching results.

How must regex be defined, to store each filenames as a matching group? I am searching for the correct regular expression, to get this result:

get_filenames
Match 1: "foo-bar"
Match 2: ""
Match 3: ""

Match 1: "foo-bar"
Match 2: "bar baz"
Match 3: "foo_bar_baz"

EDIT: in my real world files, there may be (and are) more than tree test files.

Thanks in advance.

CodePudding user response:

You can set IFS=', ' and have bash do the splitting for you.

line='\TestFiles{foo-bar, bar baz,foo_bar_baz}'

[[ $line = \\TestFiles{* ]] && {
    # Remove leading '\Testfiles{'
    # Remove trailing }
    line=${line#*{} 
    line=${line%}}

    IFS=', ' read -a filenames <<< "$line"

    declare -p filenames
}
declare -a filenames=([0]="foo-bar" [1]="bar baz" [2]="foo_bar_baz}}")

CodePudding user response:

I believe this is the regular expression you're looking for:

(?<=\\TestFiles{.*)([\w\d\-\ _] )[, }] 

You can see it working, modify it and have an explanation on what it does in the following link: https://regex101.com/r/0W8PBi/1

CodePudding user response:

Use set with IFS to split each line into new positional parameters. Assign $@ to an array so that elements can be accessed by index. Trying this with $@ directly results in a bad substitution error.

get-filenames.sh

#!/usr/bin/env bash

get_filenames() {
    local IFS=' {},'
    declare -a names

    while read -r line; do
        set -- $line
        names=($@)
        test "${names[0]}" == '\TestFiles' && {
            for i in {1..3}; do
                printf "Match %i: \"%s\"\n" $i ${names[$i]}
            done
        }
        echo
    done < 'mystyle.dtx'
}

get_filenames

mystyle.dtx

\TestFiles{foo-bar}
\TestFiles{foo-bar, bar baz,foo_bar_baz}

output

Match 1: "foo-bar"
Match 2: ""
Match 3: ""

Match 1: "foo-bar"
Match 2: "bar baz"
Match 3: "foo_bar_baz"

CodePudding user response:

EDIT (without external programs, though it's rather impractical, and tied to exactly three matches)

function get_filenames () {
    p='([^, }]*) *,? *'
    regex="\\TestFiles\{$p$p$p"

    while read -r line ; do
        if [[ $line =~ $regex ]] ; then
            i=1
            while [ $i -le 3 ]; do
                echo "Match $i: \"${BASH_REMATCH[$i]}\""
                i=$(( i   1 ))
            done
            echo
        fi
    done < mystyle.dtx
}

If you really need to output exactly three file names (even empty) for each '\TestFiles' row then here's the code.

function get_filenames () {
    MAX_FILES_CNT=3
    IFS=$'\n'
    for line in $(grep -oP '\\TestFiles\{\K[^}]*' < mystyle.dtx); do
        filenames=()
        for filename in $(grep -m $MAX_FILES_CNT -oP "[^, ] " <<< "$line"); do
            filenames =("$filename")
        done
        i=0
        while [ $i -lt $MAX_FILES_CNT ]; do
            echo "Match $(($i 1)): \"${filenames[i]}\""
            i=$(( i   1 ))
        done
        echo ""
    done
    unset IFS
}

Match 1: "foo-bar"

Match 2: ""

Match 3: ""

Match 1: "foo-bar"

Match 2: "bar baz"

Match 3: "foo_bar_baz"

By the way, BASH_REMATCH is no good for this task, cause it captures only last rematch. Look

[[ "asdf" =~ (.)* ]]
echo "${BASH_REMATCH[@]}"

asdf f

Also I would recommend to read this question https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice

  • Related