Home > Net >  Extract a substring when there is no clear pattern in Linux
Extract a substring when there is no clear pattern in Linux

Time:09-26

I am not super advanced in coding and have been struggling with this problem. I need to extract a substring from a .txt file but there is no clear pattern for me to be able to use awk or cut commands. I need to extract the value for AF in each line in the picture below (circled in blue), however, the number of characters for this string varies from line to line, and the location of the string changes from line to line as well. I tried using grep but it is only returning "AF=", not the number values that follow. I also thought about using the re.findall command in python but the python environment that I have in Ubuntu isn't letting me use it.enter image description here

I would greatly appreciate any guidance, thank you!!!

CodePudding user response:

Since the example text is not provided as text but as image, here is my own example text (generated by me, by randomly tapping keyboard):

AF=32435.42235;dw=234;324f3rg;3frg4;3gr4w;g4rw5
j6u;5ju65e;t42r;g5b5;AF=32.43542235;dw=234;324f3rg;3frg4;3gr4w;g4rw5
3f4gh5y4bt4h5;g4;3h;4j64g;y;AF=32435.42235;dw=234;324f3rg;3frg4;3gr4w;g4rw5

What I noticed is that it's like table, with each fields separated with semicolon (;), and value is defined with KEY=VALUE

To just get value of AF field, you can use grep with such pattern: AF=[0-9.]

Explanation: [0-9.] will match character 0123456789., and will match if it occurs once or more

Here is example terminal output:

$ cat /tmp/a
AF=32435.42235;dw=234;324f3rg;3frg4;3gr4w;g4rw5
j6u;5ju65e;t42r;g5b5;AF=32.43542235;dw=234;324f3rg;3frg4;3gr4w;g4rw5
3f4gh5y4bt4h5;g4;3h;4j64g;y;AF=32435.42235;dw=234;324f3rg;3frg4;3gr4w;g4rw5

$ grep -o -E 'AF=[0-9.] ' /tmp/a
AF=32435.42235
AF=32.43542235
AF=32435.42235

Now if you want only the numbers (without the AF= prefix), you can just pipe it to other grep command like such:

$ grep -o -E 'AF=[0-9.] ' /tmp/a | grep -o -E '[0-9.] '
32435.42235
32.43542235
32435.42235

Grep flag explanation: -E enables extended regular expression, -o only output match instead of whole line

CodePudding user response:

You can use grep to match everything from AF= up to but not including the first semicolon:

grep -o 'AF=[^;]*'

To guard against spurious matches when AF= appears elsewhere in a line, the following will match only when AF= begins on a word boundary:

grep -o '\bAF=[^;]*'

CodePudding user response:

Grep should be the best way to do it, but here is an awk

echo "test;AF=342435.34234;yes=3434" | awk -F'AF=' '{split($2,a,";");print FS a[1]}'
AF=342435.34234

It finds the AF= tag, then take rest of the text unn til ;

  • Related