I have a file that contain sample IDs. I want to generate a sample participant look up table which should have two columns separated by tab. The first column should be GTEX-1117F-0226-SM-5GZZ7 GTEX-1117F I was able to get the first ID from the file:
grep "GTEX" gene_tpm_2017-06-05_v8_brain_cortex.gct | awk '{$1=$2=$3=$4=""; printf $0 }' | xargs -n1 > ids_bed.txt
Now my ids_bed.txt file look like this:
GTEX-1117F-3226-SM-5N9CT
GTEX-111FC-3126-SM-5GZZ2
GTEX-1128S-2726-SM-5H12C
GTEX-117XS-3026-SM-5N9CA
GTEX-1192X-3126-SM-5N9BY
GTEX-11DXW-1126-SM-5H12Q
I want to add GTEX-1117F as the second column and so on I tried to do this:
sed -re 's/(GTEX-[[:alnum:]] )_\1/\1/g' ids_bed.txt > ids_bed_1.txt
but it doesn't generate the second column. I want my final file to look like this: both the columns separated by tab:
GTEX-1117F-3226-SM-5N9CT GTEX-1117F
GTEX-111FC-3126-SM-5GZZ2 GTEX-111FC
CodePudding user response:
If your shown samples(last ones) are final output needed then you can simply try following awk
should work.
awk 'BEGIN{FS=OFS="-"} {print $0" "$1,$2}' Input_file
Explanation: Simple explanation would be, setting FS
and OFS
to -
in BEGIN
section. In main program printing current line followed by space, 1st field OFS and 2nd field.
CodePudding user response:
I would use GNU sed
for this task following way, let file.txt
content be
GTEX-1117F-3226-SM-5N9CT
GTEX-111FC-3126-SM-5GZZ2
GTEX-1128S-2726-SM-5H12C
GTEX-117XS-3026-SM-5N9CA
GTEX-1192X-3126-SM-5N9BY
GTEX-11DXW-1126-SM-5H12Q
then
sed 's/\(GTEX-[^-]*\)\(.*\)/\1\2\t\1/' file.txt
gives output
GTEX-1117F-3226-SM-5N9CT GTEX-1117F
GTEX-111FC-3126-SM-5GZZ2 GTEX-111FC
GTEX-1128S-2726-SM-5H12C GTEX-1128S
GTEX-117XS-3026-SM-5N9CA GTEX-117XS
GTEX-1192X-3126-SM-5N9BY GTEX-1192X
GTEX-11DXW-1126-SM-5H12Q GTEX-11DXW
Explanation: I use 2 capturing groups one for GTEX-(anything but -) and one for rest of line. I replace whole line by \1\2
which is whole line, TAB then content of 1st group.
(tested in GNU sed 4.7)
CodePudding user response:
Using sed
$ sed -E 's/(.*)([^-]*-){3}.*/&\t\1/' input_file
Using awk
$awk -F'-' '{s=$1FS$2;$0=$0"\t"s}1' OFS="-" input_file
Output
GTEX-1117F-3226-SM-5N9CT GTEX-1117F
GTEX-111FC-3126-SM-5GZZ2 GTEX-111FC
GTEX-1128S-2726-SM-5H12C GTEX-1128S
GTEX-117XS-3026-SM-5N9CA GTEX-117XS
GTEX-1192X-3126-SM-5N9BY GTEX-1192X
GTEX-11DXW-1126-SM-5H12Q GTEX-11DXW
CodePudding user response:
$ awk -F'-' -v OFS='\t' '{print $0, $1 FS $2}' ids_bed.txt
GTEX-1117F-3226-SM-5N9CT GTEX-1117F
GTEX-111FC-3126-SM-5GZZ2 GTEX-111FC
GTEX-1128S-2726-SM-5H12C GTEX-1128S
GTEX-117XS-3026-SM-5N9CA GTEX-117XS
GTEX-1192X-3126-SM-5N9BY GTEX-1192X
GTEX-11DXW-1126-SM-5H12Q GTEX-11DXW
CodePudding user response:
Another option with awk
using your pattern matching GTEX and 1 alphanumeric chars from the start of the string.
If there is a match, print the whole line plus the match.
awk 'match($0, /^GTEX-[[:alnum:]] /) {
print $0, substr($0, RSTART, RLENGTH)
}' file
Output
GTEX-1117F-3226-SM-5N9CT GTEX-1117F
GTEX-111FC-3126-SM-5GZZ2 GTEX-111FC
GTEX-1128S-2726-SM-5H12C GTEX-1128S
GTEX-117XS-3026-SM-5N9CA GTEX-117XS
GTEX-1192X-3126-SM-5N9BY GTEX-1192X
GTEX-11DXW-1126-SM-5H12Q GTEX-11DXW