Home > Blockchain >  Bash regex to check file extensions
Bash regex to check file extensions

Time:02-05

I am trying to check the type of a given file and if it is what I expect. It can have one of three extensions .fa, .fasta or .fasta.gz. Looking at other questions I think this should be quite trivial however when I try suggestions they do not work for me.

This is what I have tried, all of which do not match:

#!/bin/bash

test1="abcdef.fa"
test2="ghijkl.fasta"
test3="mnopqr.fasta.gz"
echo "test1: $test1"
echo "test2: $test2"
echo "test3: $test3"

# Attempt 1
if [[ $test1 =~ *.fa|*.fasta|*.fasta.gz ]] &> /dev/null; then printf "Attempt1: Match with $test1\n"; fi
if [[ $test2 =~ *.fa|*.fasta|*.fasta.gz ]] &> /dev/null; then printf "Attempt1: Match with $test2\n"; fi
if [[ $test3 =~ *.fa|*.fasta|*.fasta.gz ]] &> /dev/null; then printf "Attempt1: Match with $test3\n"; fi

# Attempt 2 - do I need to quote the string?
if [[ "$test1" =~ *.fa|*.fasta|*.fasta.gz ]] &> /dev/null; then printf "Attempt2: Match with $test1\n"; fi
if [[ "$test2" =~ *.fa|*.fasta|*.fasta.gz ]] &> /dev/null; then printf "Attempt2: Match with $test2\n"; fi
if [[ "$test3" =~ *.fa|*.fasta|*.fasta.gz ]] &> /dev/null; then printf "Attempt2: Match with $test3\n"; fi

# Attempt 3 - alternative regex
if [[ $test1 =~ .\*.(fa|fasta|fasta.gz) ]] &> /dev/null; then printf "Attempt3: Match with $test1\n"; fi
if [[ $test2 =~ .\*.(fa|fasta|fasta.gz) ]] &> /dev/null; then printf "Attempt3: Match with $test2\n"; fi
if [[ $test3 =~ .\*.(fa|fasta|fasta.gz) ]] &> /dev/null; then printf "Attempt3: Match with $test3\n"; fi

# Attempt 4 - again with the quoted string
if [[ "$test1" =~ .\*.(fa|fasta|fasta.gz) ]] &> /dev/null; then printf "Attempt4: Match with $test1\n"; fi
if [[ "$test2" =~ .\*.(fa|fasta|fasta.gz) ]] &> /dev/null; then printf "Attempt4: Match with $test2\n"; fi
if [[ "$test3" =~ .\*.(fa|fasta|fasta.gz) ]] &> /dev/null; then printf "Attempt4: Match with $test3\n"; fi

# Attempt 5 - put $ on end of regex
if [[ $test1 =~ .\*.(fa|fasta|fasta.gz)$ ]] &> /dev/null; then printf "Attempt5: Match with $test1\n"; fi
if [[ $test2 =~ .\*.(fa|fasta|fasta.gz)$ ]] &> /dev/null; then printf "Attempt5: Match with $test2\n"; fi
if [[ $test3 =~ .\*.(fa|fasta|fasta.gz)$ ]] &> /dev/null; then printf "Attempt5: Match with $test3\n"; fi

# Attempt 6 - again with the quoted string
if [[ "$test1" =~ .\*.(fa|fasta|fasta.gz)$ ]] &> /dev/null; then printf "Attempt6: Match with $test1\n"; fi
if [[ "$test2" =~ .\*.(fa|fasta|fasta.gz)$ ]] &> /dev/null; then printf "Attempt6: Match with $test2\n"; fi
if [[ "$test3" =~ .\*.(fa|fasta|fasta.gz)$ ]] &> /dev/null; then printf "Attempt6: Match with $test3\n"; fi

# Attempt 7 - use double ||
if [[ $test1 =~ .\*.(fa||fasta||fasta.gz) ]] &> /dev/null; then printf "Attempt7: Match with $test1\n"; fi
if [[ $test2 =~ .\*.(fa||fasta||fasta.gz) ]] &> /dev/null; then printf "Attempt7: Match with $test2\n"; fi
if [[ $test3 =~ .\*.(fa||fasta||fasta.gz) ]] &> /dev/null; then printf "Attempt7: Match with $test3\n"; fi

I am close with this:

# Attempt 8 - escape parentheses
if [[ $test1 =~ .\*.\(fa|fasta|fasta.gz\) ]] &> /dev/null; then printf "Attempt8: Match with $test1\n"; fi
if [[ $test2 =~ .\*.\(fa|fasta|fasta.gz\) ]] &> /dev/null; then printf "Attempt8: Match with $test2\n"; fi
if [[ $test3 =~ .\*.\(fa|fasta|fasta.gz\) ]] &> /dev/null; then printf "Attempt8: Match with $test3\n"; fi

However the first test does not work and the output looks like this:

test1: abcdef.fa
test2: ghijkl.fasta
test3: mnopqr.fasta.gz
Attempt8: Match with ghijkl.fasta
Attempt8: Match with mnopqr.fasta.gz

What am I missing?

CodePudding user response:

You could try a case statement, something like:

case "$test1" in
  *.fa|*.fasta|*.fasta.gz) printf 'Attempt1: Match with %s\n' "$test1";;
esac

case "$test2" in
  *.fa|*.fasta|*.fasta.gz) printf 'Attempt1: Match with %s\n' "$test2";;
esac

case "$test3" in
  *.fa|*.fasta|*.fasta.gz) printf 'Attempt1: Match with %s\n' "$test3";;
esac

  • See help case

  • See LESS=' /case word in' man bash

CodePudding user response:

=~ is supposed to accept regex patterns and not glob patterns. Try \.(fa|fasta|fasta\.gz)$.

Also you can use extended pattern matching: [[ $test1 == *.@(fa|fasta|fasta.gz) ]]

CodePudding user response:

It's much easier to define regex in a variable :

#!/usr/bin/env bash

test1="abcdef.fa"
test2="ghijkl.fasta"
test3="mnopqr.fasta.gz"
echo "test1: $test1"
echo "test2: $test2"
echo "test3: $test3"

pattern='\.(fa|fasta|fasta.gz)$'
# Attempt 1
if [[ $test1 =~ $pattern ]] &> /dev/null; then printf "Attempt1: Match with $test1\n"; fi
if [[ $test2 =~ $pattern ]] &> /dev/null; then printf "Attempt1: Match with $test2\n"; fi
if [[ $test3 =~ $pattern ]] &> /dev/null; then printf "Attempt1: Match with $test3\n"; fi
  • Related