I have a huge text file (thousands of lines) containing chuncks of information (previously filtered) which are separated by an empty line. Example with a few of them:
Name: CAR 8:1
Precursor type: [M]
Formula: C15H28NO4
Num Peaks: 2
85.02841 800
286.2013 999
Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M H]
Formula: C69H131NO10
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M H]
Formula: C69H131NO10
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
I would like to access to the name contained in field "Name" and add a new field named "Compound ID" behind "Formula" Expected output:
Name: CAR 8:1
Precursor type: [M]
Formula: C15H28NO4
Compound ID: CAR 8:1
Num Peaks: 2
85.02841 800
286.2013 999
Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M H]
Formula: C69H131NO10
Compound ID: AHexCer (O-28:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M H]
Formula: C69H131NO10
Compound ID: AHexC (O-8:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
So far I've tried:
#the script need to recieve the name of the file to be formatted
original_file=$1
#negative match to only retain fields of interest and creating a tmp file
cat $original_file | grep -vE 'MZ|SMILES|TIME|KEY|CCS|MODE|CLASS|Comment' > formatted.txt
#getting Name index
index=($(grep -nE 'Name. ' MSDIAL-TandemMassSpectralAtlas-VS69-Pos_PQI.msp | cut -d ":" -f 1))
#getting the name
name=($(grep -nE 'Name. ' MSDIAL-TandemMassSpectralAtlas-VS69-Pos_PQI.msp | cut -d " " -f 2,3,4))
#reading and adding the new field
for i in ${index[*]}; do
for x in ${name[$i]}; do
p=4
pos=`expr $i $p`
sed -i "$pos i Compound ID: ${x}" formatted.txt
done
done
In the inner for loop is giving me problems since the names are separated by an empty space so my strategy of matching the index number with the index of the names does not work.
I don't know whether there's a way to do it in bash or awk.
CodePudding user response:
If sed
is an option, you can try this
$ sed '/Name:/{p;s/[^:]*\(.*\)/Compound ID\1/;h;d};/Formula:/{G}' input_file
Name: CAR 8:1
Precursor type: [M]
Formula: C15H28NO4
Compound ID: CAR 8:1
Num Peaks: 2
85.02841 800
286.2013 999
Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M H]
Formula: C69H131NO10
Compound ID: AHexCer (O-28:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M H]
Formula: C69H131NO10
Compound ID: AHexC (O-8:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
/Name:/{p
- Match lines withName:
and print it. This will create a duplicate line.s/[^:]*\(.*\)/Compound ID\1/;h;d}
- With the newly created dup line, transform it by removingName:
and retaining everything else, replaceName:
withCompound ID
, put it into hold space then delete it./Formula:/{G}
- Match lines withFormula:
then append the contents of the hold space.
CodePudding user response:
Assumptions:
- all 'chunks' contain a
Name:
line and aForumla:
line
One awk
idea:
awk '
{ print $0 # print current line
if ($1 == "Name:")
compid=gensub(/Name:/,"Compound ID:",1) # save current line, replacing "Name:" with "Compound ID:"
if ($1 == "Formula:")
print compid # print "Compound ID:" line
}
' input.txt
This generates:
Name: CAR 8:1
Precursor type: [M]
Formula: C15H28NO4
Compound ID: CAR 8:1
Num Peaks: 2
85.02841 800
286.2013 999
Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M H]
Formula: C69H131NO10
Compound ID: AHexCer (O-28:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M H]
Formula: C69H131NO10
Compound ID: AHexC (O-8:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100