I have a file with 7 columns with 192 million lines. I want to filter the file so it only has data beginning with chr1_
and chr7_
in the second column.
head file.txt.gz
gene_id variant_id tss_distance ma_samples ma_count maf pval_nominal slope slope_se
ENSG00000227232.5 chr1_13550_G_A_b38 -16003 16 16 0.0132231 0.329834 0.188778 0.193552
ENSG00000227232.5 chr1_14671_G_C_b38 -14882 12 12 0.00991736 0.618791 0.110828 0.222611
ENSG00000227232.5 chr2_14677_G_A_b38 -14876 60 60 0.0495868 0.378305 -0.090737 0.102905
ENSG00000227232.5 chr3_16841_G_T_b38 -12712 46 46 0.0380165 0.100419 -0.191008 0.116067
ENSG00000227232.5 chrX_16856_A_G_b38 -12697 10 10 0.00826446 0.708684 -0.0901965 0.241282
ENSG00000227232.5 chrX_17005_A_G_b38 -12548 18 18 0.014876 0.153674 -0.257458 0.180205
ENSG00000227232.5 chr4_17005_A_G_b38 -12548 18 18 0.014876 0.153674 -0.257458 0.180205
ENSG00000227232.5 chr7_17005_A_G_b38 -12548 18 18 0.014876 0.153674 -0.257458 0.180205
output:
head file.txt.gz
gene_id variant_id tss_distance ma_samples ma_count maf pval_nominal slope slope_se
ENSG00000227232.5 chr1_13550_G_A_b38 -16003 16 16 0.0132231 0.329834 0.188778 0.193552
ENSG00000227232.5 chr1_14671_G_C_b38 -14882 12 12 0.00991736 0.618791 0.110828 0.222611
ENSG00000227232.5 chr7_17005_A_G_b38 -12548 18 18 0.014876 0.153674 -0.257458 0.180205
The second column has data in the format chrnumber _number_letter_letter_b38
. The number and letters can be different. E.g chr4_17005_A_G_b38
or ch7_17090_A_T_b38
. I just want the second column to begin with chr1_
or chr7_
. How would I do this using awk
?
I have tired
gunzip -c file.txt.gz | awk '$2 ~ /^chr1/' > output.txt
However the output also contains chr19 and chr10. Everything with 1. I am also unsure how to include chr7.
CodePudding user response:
You may use:
gunzip -c file.txt.gz | awk '$2 ~ /^chr[17]_/' > output.txt
^chr[17]_
will match chr1_
or chr7_
right after start position. By adding _
we make sure that we don't match chr10
or chr75
.
CodePudding user response:
For checking if text starts with another text you might use index
function, let file.txt
content be
gene_id variant_id tss_distance ma_samples ma_count maf pval_nominal slope slope_se
ENSG00000227232.5 chr1_13550_G_A_b38 -16003 16 16 0.0132231 0.329834 0.188778 0.193552
ENSG00000227232.5 chr1_14671_G_C_b38 -14882 12 12 0.00991736 0.618791 0.110828 0.222611
ENSG00000227232.5 chr2_14677_G_A_b38 -14876 60 60 0.0495868 0.378305 -0.090737 0.102905
ENSG00000227232.5 chr3_16841_G_T_b38 -12712 46 46 0.0380165 0.100419 -0.191008 0.116067
ENSG00000227232.5 chrX_16856_A_G_b38 -12697 10 10 0.00826446 0.708684 -0.0901965 0.241282
ENSG00000227232.5 chrX_17005_A_G_b38 -12548 18 18 0.014876 0.153674 -0.257458 0.180205
ENSG00000227232.5 chr4_17005_A_G_b38 -12548 18 18 0.014876 0.153674 -0.257458 0.180205
ENSG00000227232.5 chr7_17005_A_G_b38 -12548 18 18 0.014876 0.153674 -0.257458 0.180205
then
awk 'index($2,"chr1_")==1||index($2,"chr7_")==1' file.txt
gives output
ENSG00000227232.5 chr1_13550_G_A_b38 -16003 16 16 0.0132231 0.329834 0.188778 0.193552
ENSG00000227232.5 chr1_14671_G_C_b38 -14882 12 12 0.00991736 0.618791 0.110828 0.222611
ENSG00000227232.5 chr7_17005_A_G_b38 -12548 18 18 0.014876 0.153674 -0.257458 0.180205
Explanation: index
function does return position of start of substring if found otherwise 0
, therefore 1
indicate that it is at beginning. I check for all strings you have enumerated and join them using logical OR (||
).
(tested in gawk 4.2.1)