Home > Blockchain >  Extract string between underscores and dot
Extract string between underscores and dot

Time:03-01

I have strings like these:

/my/directory/file1_AAA_123_k.txt 
/my/directory/file2_CCC.txt
/my/directory/file2_KK_45.txt

So basically, the number of underscores is not fixed. I would like to extract the string between the first underscore and the dot. So the output should be something like this:

AAA_123_k
CCC
KK_45

I found this solution that works:

string='/my/directory/file1_AAA_123_k.txt'
tmp="${string%.*}"
echo $tmp | sed  's/^[^_:]*[_:]//'

But I am wondering if there is a more 'elegant' solution (e.g. 1 line code).

CodePudding user response:

With bash version >= 3.0 and a regex:

[[ "$string" =~ _(. )\. ]] && echo "${BASH_REMATCH[1]}"

CodePudding user response:

You can use a single sed command like

sed -n 's~^.*/[^_/]*_\([^/]*\)\.[^./]*$~\1~p' <<< "$string"
sed -nE 's~^.*/[^_/]*_([^/]*)\.[^./]*$~\1~p' <<< "$string"

See the online demo. Details:

  • ^ - start of string
  • .* - any text
  • / - a / char
  • [^_/]* - zero or more chars other than / and _
  • _ - a _ char
  • \([^/]*\) (POSIX BRE) / ([^/]*) (POSIX ERE, enabled with E option) - Group 1: any zero or more chars other than /
  • \. - a dot
  • [^./]* - zero or more chars other than . and /
  • $ - end of string.

With -n, default line output is suppressed and p only prints the result of successful substitution.

CodePudding user response:

If you need to process the file names one at a time (eg, within a while read loop) you can perform two parameter expansions, eg:

$ string='/my/directory/file1_AAA_123_k.txt.2'
$ tmp="${string#*_}"
$ tmp="${tmp%%.*}"
$ echo "${tmp}"
AAA_123_k

One idea to parse a list of file names at the same time:

$ cat file.list
/my/directory/file1_AAA_123_k.txt.2
/my/directory/file2_CCC.txt
/my/directory/file2_KK_45.txt

$ sed -En 's/[^_]*_([^.] ).*/\1/p' file.list
AAA_123_k
CCC
KK_45

CodePudding user response:

Using sed

$ sed 's/[^_]*_//;s/\..*//' input_file
AAA_123_k
CCC
KK_45

CodePudding user response:

With your shown samples, with GNU grep you could try following code.

grep -oP '.*?_\K([^.]*)' Input_file

Explanation: Using GNU grep's -oP options here to print exact match and to enable PCRE regex respectively. In main program using regex .*?_\K([^.]*) to get value between 1st _ and first occurrence of .. Explanation of regex is as follows:

Explanation of regex:

.*?_     ##Matching from starting of line to till first occurrence of _ by using lazy match .*?
\K       ##\K will forget all previous matched values by regex to make sure only needed values are printed.
([^.]*)  ##Matching everything till first occurrence of dot as per need.

CodePudding user response:

This is easy, except that it includes the initial underscore:

ls | grep -o "_[^.]*"
  • Related