I'm trying to format the family IDs on a fam file whose sample and family IDs are the same, and coded in the following way:
Continent_Breed_Ind-ID
The idea would be to transform column 1 into something that only contains continent breed, but keeping the other columns.
Mock dataset:
Continent1_Breed1_Ind-ID1 Continent1_Breed1_Ind-ID1 0 0 0 -9
Continent1_Breed2_Ind-ID2 Continent1_Breed2_Ind-ID1 0 0 0 -0
Continent2_Breed1_Ind-ID1 Continent2_Breed1_Ind-ID1 0 0 0 -9
Desired outcome:
Continent1_Breed1 Continent1_Breed1_Ind-ID1 0 0 0 -9
Continent1_Breed2 Continent1_Breed2_Ind-ID1 0 0 0 -0
Continent2_Breed1 Continent2_Breed1_Ind-ID1 0 0 0 -9
I have tried using sed as follows:
sed -r 's/_[^_]*//2g' file.fam
But that only gives me the first column.
Any ideas?
CodePudding user response:
You may use this simple sed
command:
sed 's/_[^_]* / /' file
Continent1_Breed1 Continent1_Breed1_Ind-ID1 0 0 0 -9
Continent1_Breed2 Continent1_Breed2_Ind-ID1 0 0 0 -0
Continent2_Breed1 Continent2_Breed1_Ind-ID1 0 0 0 -9
Here:
_[^_]*
: Matches_
followed by 0 or more non-_
characters followed by a space- We replace this match by a space to get the space between first and second column back
PS: Note that there is no global flag used here.
CodePudding user response:
1st solution: With your shown samples, please try following sed
command. Using -E
option to ERE(extended regular expression) here.
sed -E 's/^([^_]*)(_[^_]*)_[^[:space:]] (.*$)/\1\2\3/' Input_file
2nd solution: With GNU awk
using match
function of it with capturing group capability try following:
awk 'match($0,/^([^_]*)(_[^_]*)_[^[:space:]] (.*$)/,arr){print arr[1] arr[2] arr[3]}' Input_file
CodePudding user response:
gawk 'sub("_[^_] $",_,$!_)_' mawk 'sub("_[^_] "," ")_'
Continent1_Breed1 Continent1_Breed1_Ind-ID1 0 0 0 -9
Continent1_Breed2 Continent1_Breed2_Ind-ID1 0 0 0 -0
Continent2_Breed1 Continent2_Breed1_Ind-ID1 0 0 0 -9
CodePudding user response:
You can use
awk '{sub(/_[^_]*$/, "", $1)}1' file > newfile
sed 's/^\([^_ ]*_[^_ ]*\)_[^_ ]*/\1/' file > newfile
See the online demo #1 and demo #2.
Details:
- The
awk
solution finds and removes the first occurrence of a_
char and then zero or more chars other than_
till end of string (withsub(/_[^_]*$/, "", $1)
) in the first field, and1
prints the result - The sed solution finds:
^
- start of string\([^_ ]*_[^_ ]*\)
- Group 1 (\1
in RHS refers to this value): zero or more chars other than space and_
, and underscore and then again zero or more chars other than space and_
_
- an underscore[^_ ]*
- zero or more chars other than space and_
.
And the match is replaced with Group 1 value.
CodePudding user response:
This might work for you (GNU sed):
sed 's/_/\n/2;s/\n\S*//' file
Replace the second _
by a newline and then remove the newline and any non-white space following it.