Linux - Get Substring from 1st occurence of character-CodePudding

FILE1.TXT

0020220101

01 20220101

Need to extra date part from file where text starts from 2

Options tried:

t_FILE_DT1='awk -F"2" '{PRINT $NF}' FILE1.TXT'
t_FILE_DT2='cut -d'2' -f2- FILE1.TXT'

echo "$t_FILE_DT1"
echo "$t_FILE_DT2"

1st output : 0101

2nd output : 0220101

Expected Output: 20220101

Im new to linux scripting. Could some one help guide where Im going wrong?

CodePudding user response：

Use grep like so:

echo "0020220101\n01 20220101" | grep -P -o '\d{8}\b'
20220101
20220101

Here, GNU grep uses the following options:
-P : Use Perl regexes.
-o : Print the matches only (1 match per line), not the entire lines.

CodePudding user response：

Using any awk:

$ awk '{print substr($0,length()-7)}' file
20220101
20220101

The above was run on this input file:

$ cat file
0020220101
01 20220101

Regarding PRINT $NF in your question - PRINT != print. Get out of the habit of using all-caps unless you're writing Cobol. See correct-bash-and-shell-script-variable-capitalization for some reasons.

The 2 in your scripts is telling awka and cut to use the character 2 as the field separator so each will carve up the input into substrings everywhere a 2 occurs.

The 's in your question are single quotes used to make strings literal, you were intending to use backticks, `cmd`, but those are deprecated in favor of $(cmd) anyway.

CodePudding user response：

I would instead of looking for "after" the 2 .. (not having to worry about whether there is a space involved as well) )

Think instead about extracting the last 8 characters, which you know for fact is your date ..

input="/path/to/txt/file/FILE1.TXT"
while IFS= read -r line
do
   # read in the last 8 characters of $line .. You KNOW this is the date .. 
   # No need to worry about exact matching at that point, or spaces .. 

   myDate=${line: -8}
   echo "$myDate"
done < "$input"

CodePudding user response：

About the cut and awk commands that you tried:

Using awk -F"2" '{PRINT $NF}' file will set the field separator to 2, and $NF is the last field, so printing the value of the last field is 0101

Using cut -d'2' -f2- file uses a delimiter of 2 as well, and then print all fields starting at the second field, which is 0220101

If you want to match the 2 followed by 7 digits until the end of the string:

awk '
match ($0, /2[0-9]{7}$/) {
  print substr($0, RSTART, RLENGTH)
}
' file

Output

20220101

CodePudding user response：

The accepted answer shows how to extract the first eight digits, but that's not what you asked.

grep -o '2.*' file

will extract from the first occurrence of 2, and

grep -o '2[0-9]*' file

will extract all the digits after every occurrence of 2. If you specifically want eight digits, try

grep -Eo '2[0-9]{7}'

maybe also with a -w option if you want to only accept a match between two word boundaries. If you specifically want only digits after the first occurrence of 2, maybe try

sed -n 's/[^2]*\(2[0-9]*\).*/\1/p' file