Home > Blockchain >  Removing everything after the second "_" but keeping other columns
Removing everything after the second "_" but keeping other columns

Time:07-21

I'm trying to format the family IDs on a fam file whose sample and family IDs are the same, and coded in the following way:

Continent_Breed_Ind-ID

The idea would be to transform column 1 into something that only contains continent breed, but keeping the other columns.

Mock dataset:

Continent1_Breed1_Ind-ID1 Continent1_Breed1_Ind-ID1 0 0 0 -9
Continent1_Breed2_Ind-ID2 Continent1_Breed2_Ind-ID1 0 0 0 -0
Continent2_Breed1_Ind-ID1 Continent2_Breed1_Ind-ID1 0 0 0 -9

Desired outcome:

Continent1_Breed1 Continent1_Breed1_Ind-ID1 0 0 0 -9
Continent1_Breed2 Continent1_Breed2_Ind-ID1 0 0 0 -0
Continent2_Breed1 Continent2_Breed1_Ind-ID1 0 0 0 -9

I have tried using sed as follows:

sed -r 's/_[^_]*//2g' file.fam

But that only gives me the first column.

Any ideas?

CodePudding user response:

You may use this simple sed command:

sed 's/_[^_]* / /' file

Continent1_Breed1 Continent1_Breed1_Ind-ID1 0 0 0 -9
Continent1_Breed2 Continent1_Breed2_Ind-ID1 0 0 0 -0
Continent2_Breed1 Continent2_Breed1_Ind-ID1 0 0 0 -9

Online Code Demo

Here:

  • _[^_]* : Matches _ followed by 0 or more non-_ characters followed by a space
  • We replace this match by a space to get the space between first and second column back

PS: Note that there is no global flag used here.

CodePudding user response:

1st solution: With your shown samples, please try following sed command. Using -E option to ERE(extended regular expression) here.

sed -E 's/^([^_]*)(_[^_]*)_[^[:space:]] (.*$)/\1\2\3/' Input_file


2nd solution: With GNU awk using match function of it with capturing group capability try following:

awk 'match($0,/^([^_]*)(_[^_]*)_[^[:space:]] (.*$)/,arr){print arr[1] arr[2] arr[3]}' Input_file

CodePudding user response:

gawk 'sub("_[^_] $",_,$!_)_'
mawk 'sub("_[^_]  "," ")_' 
Continent1_Breed1 Continent1_Breed1_Ind-ID1 0 0 0 -9
Continent1_Breed2 Continent1_Breed2_Ind-ID1 0 0 0 -0
Continent2_Breed1 Continent2_Breed1_Ind-ID1 0 0 0 -9

CodePudding user response:

You can use

awk '{sub(/_[^_]*$/, "", $1)}1' file > newfile
sed 's/^\([^_ ]*_[^_ ]*\)_[^_ ]*/\1/' file > newfile

See the online demo #1 and demo #2.

Details:

  • The awk solution finds and removes the first occurrence of a _ char and then zero or more chars other than _ till end of string (with sub(/_[^_]*$/, "", $1)) in the first field, and 1 prints the result
  • The sed solution finds:
    • ^ - start of string
    • \([^_ ]*_[^_ ]*\) - Group 1 (\1 in RHS refers to this value): zero or more chars other than space and _, and underscore and then again zero or more chars other than space and _
    • _ - an underscore
    • [^_ ]* - zero or more chars other than space and _.

And the match is replaced with Group 1 value.

CodePudding user response:

This might work for you (GNU sed):

sed 's/_/\n/2;s/\n\S*//' file

Replace the second _ by a newline and then remove the newline and any non-white space following it.

  • Related