Home > other >  Add characters to a column if the row starts with a specific character, and do this for odd and even
Add characters to a column if the row starts with a specific character, and do this for odd and even

Time:07-18

I have multiple alignment format (MAF) files that look like this:

##maf version=1
a       score=-1274
s       Chr10                                            34972197            2927               190919061         AACCTTGGGG
s       Chr11                                            36777315            2442               244384623         AACCTTGGGG

a       score=-60687
s       Chr1                                             81897274           61972               159217232          CGTTTTCCCGG
s       Chr1                                             33997294           32248               200980605   

I would like to modify the second column of these files for lines that start with "s", to have something like this:

##maf version=1
a       score=-1274
s       species1.Chr10                                            34972197            2927               190919061         AACCTTGGGG
s       species2.Chr11                                            36777315            2442               244384623         AACCTTGGGG

a       score=-60687
s       species1.Chr1                                             81897274           61972               159217232          CGTTTTCCCGG
s       species2.Chr1                                             33997294           32248               200980605          CGTTTTCCCGG   

Using this idea https://unix.stackexchange.com/questions/154220/adding-a-character-to-every-other-text-line

I am trying things like this:

awk '$1 == "s" {print ((NR%2)? "species1.":"") $0}'

But I am still far to reach my objective. Do you know how I could achieve this?

CodePudding user response:

Assumptions:

  • distance between fields is to be maintained

One awk idea:

awk '
!/^s/ { print; sfx=0 }                  # if line does not start with "^s" then print line and reset sfx variable
 /^s/ { n=split($0,a,FS,seps)           # if line starts with "^s" then split current line; key is to save each separator as a separate seps[] array entry
        a[2]="species"   sfx "." a[2]   # add prefix to value in 2nd field
        for (i=1;i<=n;i  )              # loop through all field/separator pairs
            printf a[i] seps[i]         # print each field/separator
        print ""                        # terminate line
      }
' maf.dat

This generates:

##maf version=1
a       score=-1274
s       species1.Chr10                                            34972197            2927               190919061         AACCTTGGGG
s       species2.Chr11                                            36777315            2442               244384623         AACCTTGGGG

a       score=-60687
s       species1.Chr1                                             81897274           61972               159217232          CGTTTTCCCGG
s       species2.Chr1                                             33997294           32248               200980605          CGTTTTCCCGG

CodePudding user response:

Perl to the rescue!

perl -pe 'if (/^s/) { s/Chr/species$x.Chr/; $x   } else { $x = 1 }' file.maf
  • -p reads the input line by line and outputs each line after processing;
  • If the line starts with s, it prefixes Chr with species and the current number stored in $x, incrementing it;
  • Otherwise, it sets $x to 1.

CodePudding user response:

 awk '
    {out=$0}/^s /{(NR%2)?s="species1."$2:s="species2."$2;sub($2,s,out)}{print out}
' file

##maf version=1
a       score=-1274
s       species1.Chr10                                            34972197            2927               190919061         AACCTTGGGG
s       species2.Chr11                                            36777315            2442               244384623         AACCTTGGGG

a       score=-60687
s       species1.Chr1                                             81897274           61972               159217232         CGTTTTCCCGG
s       species2.Chr1                                             33997294           32248               200980605         CGTTTTCCCGG 
  • Related