I'm currently trying to use grep
to extract an MD5 hash and a file name, contained in a url using regex. It's been brought to my attention sed
may be a better tool. Any tool I can use as part of a shell script is fine.
The data looks like this
nR220-L3-G2-Content-List
Library "JoeC_SMATerV3_RNAseq_21ple" - Barcode "TCCAACGC-AAGTCCAA" (55,208,056 paired-sequences) https://hts.igb.abc.edu/username22080132/nR220-L3-G2-P010-TCCAACGC-AAGTCCAA-READ1-Sequences.txt.gz (2.6 GB) md5sum: 458d49cc8ac2e8437109be953ed8de68 https://hts.igb.abc.edu/username22080132/nR220-L3-G2-P010-TCCAACGC-AAGTCCAA-READ2-Sequences.txt.gz (2.8 GB) md5sum: 5619557a80ffeed323347709f6413d81
Library "JoeC_SMATerV3_RNAseq_21ple" - Barcode "CCGTGAAG-ATCCACTG" (50,913,164 paired-sequences) https://hts.igb.abc.edu/username22080132/nR220-L3-G2-P011-CCGTGAAG-ATCCACTG-READ1-Sequences.txt.gz (2.4 GB) md5sum: 78a919af9093a368f4e95ca5ee87b764 https://hts.igb.abc.edu/username22080132/nR220-L3-G2-P011-CCGTGAAG-ATCCACTG-READ2-Sequences.txt.gz (2.6 GB) md5sum: ad35fc8db0861206697e24f32743616e
I want the result to look like this
458d49cc8ac2e8437109be953ed8de68 nR220-L3-G2-P010-TCCAACGC-AAGTCCAA-READ1-Sequences.txt.gz
5619557a80ffeed323347709f6413d81 nR220-L3-G2-P010-TCCAACGC-AAGTCCAA-READ2-Sequences.txt.gz
78a919af9093a368f4e95ca5ee87b764 nR220-L3-G2-P011-CCGTGAAG-ATCCACTG-READ1-Sequences.txt.gz
ad35fc8db0861206697e24f32743616e nR220-L3-G2-P011-CCGTGAAG-ATCCACTG-READ2-Sequences.txt.gz
I can get the MD5 with grep -Po [a-f0-9]{32}
, but I can't figure out how to extract the file name from the longer url when it's part of long string and place it after the hash
CodePudding user response:
This is how it could be done with sed
sed -rne 's/.* https.*[/]([^/] \.txt\.gz) .* md5sum: *([a-f0-9]{32}) https.*[/]([^/] \.txt\.gz) .* md5sum: *([a-f0-9]{32})/\2 \1\n\4 \3\n/gp' test.txt
Result:
458d49cc8ac2e8437109be953ed8de68 nR220-L3-G2-P010-TCCAACGC-AAGTCCAA-READ1-Sequences.txt.gz
5619557a80ffeed323347709f6413d81 nR220-L3-G2-P010-TCCAACGC-AAGTCCAA-READ2-Sequences.txt.gz
78a919af9093a368f4e95ca5ee87b764 nR220-L3-G2-P011-CCGTGAAG-ATCCACTG-READ1-Sequences.txt.gz
ad35fc8db0861206697e24f32743616e nR220-L3-G2-P011-CCGTGAAG-ATCCACTG-READ2-Sequences.txt.gz
CodePudding user response:
Using sed
$ sed -En 's/[^(]*\(?[^(]*\/([^ ]*)[^:]*: ([^ ]*)/\2 \1\n/pg' input_file
458d49cc8ac2e8437109be953ed8de68 nR220-L3-G2-P010-TCCAACGC-AAGTCCAA-READ1-Sequences.txt.gz
5619557a80ffeed323347709f6413d81 nR220-L3-G2-P010-TCCAACGC-AAGTCCAA-READ2-Sequences.txt.gz
78a919af9093a368f4e95ca5ee87b764 nR220-L3-G2-P011-CCGTGAAG-ATCCACTG-READ1-Sequences.txt.gz
ad35fc8db0861206697e24f32743616e nR220-L3-G2-P011-CCGTGAAG-ATCCACTG-READ2-Sequences.txt.gz