How can I select a string by reading a field name and use it to place it in a different field in bas-CodePudding

I have a huge text file (thousands of lines) containing chuncks of information (previously filtered) which are separated by an empty line. Example with a few of them:

Name: CAR 8:1
Precursor type: [M] 
Formula: C15H28NO4
Num Peaks: 2
85.02841 800
286.2013 999

Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M H] 
Formula: C69H131NO10
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M H] 
Formula: C69H131NO10
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

I would like to access to the name contained in field "Name" and add a new field named "Compound ID" behind "Formula" Expected output:

Name: CAR 8:1
Precursor type: [M] 
Formula: C15H28NO4
Compound ID: CAR 8:1
Num Peaks: 2
85.02841 800
286.2013 999

Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M H] 
Formula: C69H131NO10
Compound ID: AHexCer (O-28:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M H] 
Formula: C69H131NO10
Compound ID: AHexC (O-8:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

So far I've tried:

#the script need to recieve the name of the file to be formatted
original_file=$1

#negative match to only retain fields of interest and creating a tmp  file
cat $original_file | grep -vE 'MZ|SMILES|TIME|KEY|CCS|MODE|CLASS|Comment' > formatted.txt

#getting Name index
index=($(grep -nE 'Name. ' MSDIAL-TandemMassSpectralAtlas-VS69-Pos_PQI.msp | cut -d ":" -f 1))

#getting the name 
name=($(grep -nE 'Name. ' MSDIAL-TandemMassSpectralAtlas-VS69-Pos_PQI.msp | cut -d " " -f 2,3,4))

#reading and adding the new field
for i in ${index[*]}; do
    for x in ${name[$i]}; do
        p=4
        pos=`expr $i   $p`
        sed -i "$pos i Compound ID: ${x}" formatted.txt
    done
done

In the inner for loop is giving me problems since the names are separated by an empty space so my strategy of matching the index number with the index of the names does not work.

I don't know whether there's a way to do it in bash or awk.

CodePudding user response：

If sed is an option, you can try this

$ sed '/Name:/{p;s/[^:]*\(.*\)/Compound ID\1/;h;d};/Formula:/{G}' input_file
Name: CAR 8:1
Precursor type: [M] 
Formula: C15H28NO4
Compound ID: CAR 8:1
Num Peaks: 2
85.02841 800
286.2013 999

Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M H] 
Formula: C69H131NO10
Compound ID: AHexCer (O-28:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M H] 
Formula: C69H131NO10
Compound ID: AHexC (O-8:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

/Name:/{p - Match lines with Name: and print it. This will create a duplicate line.
s/[^:]*\(.*\)/Compound ID\1/;h;d} - With the newly created dup line, transform it by removing Name: and retaining everything else, replace Name: with Compound ID, put it into hold space then delete it.
/Formula:/{G} - Match lines with Formula: then append the contents of the hold space.

CodePudding user response：

Assumptions:

all 'chunks' contain a Name: line and a Forumla: line

One awk idea:

awk '
{ print $0                                       # print current line
  if ($1 == "Name:")
     compid=gensub(/Name:/,"Compound ID:",1)     # save current line, replacing "Name:" with "Compound ID:"
  if ($1 == "Formula:") 
     print compid                                # print "Compound ID:" line
}
' input.txt

This generates:

Name: CAR 8:1
Precursor type: [M] 
Formula: C15H28NO4
Compound ID: CAR 8:1
Num Peaks: 2
85.02841 800
286.2013 999

Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M H] 
Formula: C69H131NO10
Compound ID: AHexCer (O-28:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M H] 
Formula: C69H131NO10
Compound ID: AHexC (O-8:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100