Using awk to only include column beginning with specific value?-CodePudding

I have a file with 7 columns with 192 million lines. I want to filter the file so it only has data beginning with chr1_ and chr7_ in the second column.

head file.txt.gz
gene_id variant_id  tss_distance    ma_samples  ma_count    maf pval_nominal    slope   slope_se
ENSG00000227232.5   chr1_13550_G_A_b38  -16003  16  16  0.0132231   0.329834    0.188778    0.193552
ENSG00000227232.5   chr1_14671_G_C_b38  -14882  12  12  0.00991736  0.618791    0.110828    0.222611
ENSG00000227232.5   chr2_14677_G_A_b38  -14876  60  60  0.0495868   0.378305    -0.090737   0.102905
ENSG00000227232.5   chr3_16841_G_T_b38  -12712  46  46  0.0380165   0.100419    -0.191008   0.116067
ENSG00000227232.5   chrX_16856_A_G_b38  -12697  10  10  0.00826446  0.708684    -0.0901965  0.241282
ENSG00000227232.5   chrX_17005_A_G_b38  -12548  18  18  0.014876    0.153674    -0.257458   0.180205
ENSG00000227232.5   chr4_17005_A_G_b38  -12548  18  18  0.014876    0.153674    -0.257458   0.180205
ENSG00000227232.5   chr7_17005_A_G_b38  -12548  18  18  0.014876    0.153674    -0.257458   0.180205

output:

head file.txt.gz
gene_id variant_id  tss_distance    ma_samples  ma_count    maf pval_nominal    slope   slope_se
ENSG00000227232.5   chr1_13550_G_A_b38  -16003  16  16  0.0132231   0.329834    0.188778    0.193552
ENSG00000227232.5   chr1_14671_G_C_b38  -14882  12  12  0.00991736  0.618791    0.110828    0.222611
ENSG00000227232.5   chr7_17005_A_G_b38  -12548  18  18  0.014876    0.153674    -0.257458   0.180205

The second column has data in the format chrnumber _number_letter_letter_b38. The number and letters can be different. E.g chr4_17005_A_G_b38 or ch7_17090_A_T_b38. I just want the second column to begin with chr1_ or chr7_. How would I do this using awk?

I have tired

gunzip -c file.txt.gz | awk '$2 ~ /^chr1/' > output.txt

However the output also contains chr19 and chr10. Everything with 1. I am also unsure how to include chr7.

CodePudding user response：

You may use:

gunzip -c file.txt.gz | awk '$2 ~ /^chr[17]_/' > output.txt

^chr[17]_ will match chr1_ or chr7_ right after start position. By adding _ we make sure that we don't match chr10 or chr75.

CodePudding user response：

For checking if text starts with another text you might use index function, let file.txt content be

gene_id variant_id  tss_distance    ma_samples  ma_count    maf pval_nominal    slope   slope_se
ENSG00000227232.5   chr1_13550_G_A_b38  -16003  16  16  0.0132231   0.329834    0.188778    0.193552
ENSG00000227232.5   chr1_14671_G_C_b38  -14882  12  12  0.00991736  0.618791    0.110828    0.222611
ENSG00000227232.5   chr2_14677_G_A_b38  -14876  60  60  0.0495868   0.378305    -0.090737   0.102905
ENSG00000227232.5   chr3_16841_G_T_b38  -12712  46  46  0.0380165   0.100419    -0.191008   0.116067
ENSG00000227232.5   chrX_16856_A_G_b38  -12697  10  10  0.00826446  0.708684    -0.0901965  0.241282
ENSG00000227232.5   chrX_17005_A_G_b38  -12548  18  18  0.014876    0.153674    -0.257458   0.180205
ENSG00000227232.5   chr4_17005_A_G_b38  -12548  18  18  0.014876    0.153674    -0.257458   0.180205
ENSG00000227232.5   chr7_17005_A_G_b38  -12548  18  18  0.014876    0.153674    -0.257458   0.180205

then

awk 'index($2,"chr1_")==1||index($2,"chr7_")==1' file.txt

gives output

ENSG00000227232.5   chr1_13550_G_A_b38  -16003  16  16  0.0132231   0.329834    0.188778    0.193552
ENSG00000227232.5   chr1_14671_G_C_b38  -14882  12  12  0.00991736  0.618791    0.110828    0.222611
ENSG00000227232.5   chr7_17005_A_G_b38  -12548  18  18  0.014876    0.153674    -0.257458   0.180205

Explanation: index function does return position of start of substring if found otherwise 0, therefore 1 indicate that it is at beginning. I check for all strings you have enumerated and join them using logical OR (||).

(tested in gawk 4.2.1)