how to edit the sample_ids using sed-CodePudding

I have a file that contain sample IDs. I want to generate a sample participant look up table which should have two columns separated by tab. The first column should be GTEX-1117F-0226-SM-5GZZ7 GTEX-1117F I was able to get the first ID from the file:

grep "GTEX" gene_tpm_2017-06-05_v8_brain_cortex.gct | awk '{$1=$2=$3=$4=""; printf $0 }' | xargs -n1 > ids_bed.txt

Now my ids_bed.txt file look like this:

GTEX-1117F-3226-SM-5N9CT
GTEX-111FC-3126-SM-5GZZ2
GTEX-1128S-2726-SM-5H12C
GTEX-117XS-3026-SM-5N9CA
GTEX-1192X-3126-SM-5N9BY
GTEX-11DXW-1126-SM-5H12Q

I want to add GTEX-1117F as the second column and so on I tried to do this:

sed -re 's/(GTEX-[[:alnum:]] )_\1/\1/g' ids_bed.txt > ids_bed_1.txt

but it doesn't generate the second column. I want my final file to look like this: both the columns separated by tab:

GTEX-1117F-3226-SM-5N9CT GTEX-1117F
GTEX-111FC-3126-SM-5GZZ2 GTEX-111FC

CodePudding user response：

If your shown samples(last ones) are final output needed then you can simply try following awk should work.

awk 'BEGIN{FS=OFS="-"} {print $0" "$1,$2}'  Input_file

Explanation: Simple explanation would be, setting FS and OFS to - in BEGIN section. In main program printing current line followed by space, 1st field OFS and 2nd field.

CodePudding user response：

I would use GNU sed for this task following way, let file.txt content be

GTEX-1117F-3226-SM-5N9CT
GTEX-111FC-3126-SM-5GZZ2
GTEX-1128S-2726-SM-5H12C
GTEX-117XS-3026-SM-5N9CA
GTEX-1192X-3126-SM-5N9BY
GTEX-11DXW-1126-SM-5H12Q

then

sed 's/\(GTEX-[^-]*\)\(.*\)/\1\2\t\1/' file.txt

gives output

GTEX-1117F-3226-SM-5N9CT    GTEX-1117F
GTEX-111FC-3126-SM-5GZZ2    GTEX-111FC
GTEX-1128S-2726-SM-5H12C    GTEX-1128S
GTEX-117XS-3026-SM-5N9CA    GTEX-117XS
GTEX-1192X-3126-SM-5N9BY    GTEX-1192X
GTEX-11DXW-1126-SM-5H12Q    GTEX-11DXW

Explanation: I use 2 capturing groups one for GTEX-(anything but -) and one for rest of line. I replace whole line by \1\2 which is whole line, TAB then content of 1st group.

(tested in GNU sed 4.7)

CodePudding user response：

Using sed

$ sed -E 's/(.*)([^-]*-){3}.*/&\t\1/' input_file

Using awk

$awk -F'-' '{s=$1FS$2;$0=$0"\t"s}1' OFS="-" input_file

Output

GTEX-1117F-3226-SM-5N9CT        GTEX-1117F
GTEX-111FC-3126-SM-5GZZ2        GTEX-111FC
GTEX-1128S-2726-SM-5H12C        GTEX-1128S
GTEX-117XS-3026-SM-5N9CA        GTEX-117XS
GTEX-1192X-3126-SM-5N9BY        GTEX-1192X
GTEX-11DXW-1126-SM-5H12Q        GTEX-11DXW

CodePudding user response：

$ awk -F'-' -v OFS='\t' '{print $0, $1 FS $2}' ids_bed.txt
GTEX-1117F-3226-SM-5N9CT        GTEX-1117F
GTEX-111FC-3126-SM-5GZZ2        GTEX-111FC
GTEX-1128S-2726-SM-5H12C        GTEX-1128S
GTEX-117XS-3026-SM-5N9CA        GTEX-117XS
GTEX-1192X-3126-SM-5N9BY        GTEX-1192X
GTEX-11DXW-1126-SM-5H12Q        GTEX-11DXW

CodePudding user response：

Another option with awk using your pattern matching GTEX and 1 alphanumeric chars from the start of the string.

If there is a match, print the whole line plus the match.

awk 'match($0, /^GTEX-[[:alnum:]] /) {
  print $0, substr($0, RSTART, RLENGTH)
}' file

Output

GTEX-1117F-3226-SM-5N9CT GTEX-1117F
GTEX-111FC-3126-SM-5GZZ2 GTEX-111FC
GTEX-1128S-2726-SM-5H12C GTEX-1128S
GTEX-117XS-3026-SM-5N9CA GTEX-117XS
GTEX-1192X-3126-SM-5N9BY GTEX-1192X
GTEX-11DXW-1126-SM-5H12Q GTEX-11DXW