Home > Back-end >  Extract md5 and last part of url
Extract md5 and last part of url

Time:08-02

I'm currently trying to use grep to extract an MD5 hash and a file name, contained in a url using regex. It's been brought to my attention sed may be a better tool. Any tool I can use as part of a shell script is fine.

The data looks like this

nR220-L3-G2-Content-List
Library "JoeC_SMATerV3_RNAseq_21ple" - Barcode "TCCAACGC-AAGTCCAA" (55,208,056 paired-sequences) https://hts.igb.abc.edu/username22080132/nR220-L3-G2-P010-TCCAACGC-AAGTCCAA-READ1-Sequences.txt.gz  (2.6 GB) md5sum:  458d49cc8ac2e8437109be953ed8de68 https://hts.igb.abc.edu/username22080132/nR220-L3-G2-P010-TCCAACGC-AAGTCCAA-READ2-Sequences.txt.gz  (2.8 GB) md5sum:  5619557a80ffeed323347709f6413d81

Library "JoeC_SMATerV3_RNAseq_21ple" - Barcode "CCGTGAAG-ATCCACTG" (50,913,164 paired-sequences) https://hts.igb.abc.edu/username22080132/nR220-L3-G2-P011-CCGTGAAG-ATCCACTG-READ1-Sequences.txt.gz  (2.4 GB) md5sum:  78a919af9093a368f4e95ca5ee87b764 https://hts.igb.abc.edu/username22080132/nR220-L3-G2-P011-CCGTGAAG-ATCCACTG-READ2-Sequences.txt.gz  (2.6 GB) md5sum:  ad35fc8db0861206697e24f32743616e

I want the result to look like this

458d49cc8ac2e8437109be953ed8de68  nR220-L3-G2-P010-TCCAACGC-AAGTCCAA-READ1-Sequences.txt.gz
5619557a80ffeed323347709f6413d81  nR220-L3-G2-P010-TCCAACGC-AAGTCCAA-READ2-Sequences.txt.gz
78a919af9093a368f4e95ca5ee87b764  nR220-L3-G2-P011-CCGTGAAG-ATCCACTG-READ1-Sequences.txt.gz
ad35fc8db0861206697e24f32743616e  nR220-L3-G2-P011-CCGTGAAG-ATCCACTG-READ2-Sequences.txt.gz

I can get the MD5 with grep -Po [a-f0-9]{32}, but I can't figure out how to extract the file name from the longer url when it's part of long string and place it after the hash

CodePudding user response:

This is how it could be done with sed

sed -rne 's/.* https.*[/]([^/] \.txt\.gz) .* md5sum: *([a-f0-9]{32}) https.*[/]([^/] \.txt\.gz) .* md5sum: *([a-f0-9]{32})/\2 \1\n\4 \3\n/gp' test.txt 

Result:

458d49cc8ac2e8437109be953ed8de68 nR220-L3-G2-P010-TCCAACGC-AAGTCCAA-READ1-Sequences.txt.gz
5619557a80ffeed323347709f6413d81 nR220-L3-G2-P010-TCCAACGC-AAGTCCAA-READ2-Sequences.txt.gz

78a919af9093a368f4e95ca5ee87b764 nR220-L3-G2-P011-CCGTGAAG-ATCCACTG-READ1-Sequences.txt.gz
ad35fc8db0861206697e24f32743616e nR220-L3-G2-P011-CCGTGAAG-ATCCACTG-READ2-Sequences.txt.gz

CodePudding user response:

Using sed

$ sed -En 's/[^(]*\(?[^(]*\/([^ ]*)[^:]*:  ([^ ]*)/\2 \1\n/pg' input_file
458d49cc8ac2e8437109be953ed8de68 nR220-L3-G2-P010-TCCAACGC-AAGTCCAA-READ1-Sequences.txt.gz
5619557a80ffeed323347709f6413d81 nR220-L3-G2-P010-TCCAACGC-AAGTCCAA-READ2-Sequences.txt.gz

78a919af9093a368f4e95ca5ee87b764 nR220-L3-G2-P011-CCGTGAAG-ATCCACTG-READ1-Sequences.txt.gz
ad35fc8db0861206697e24f32743616e nR220-L3-G2-P011-CCGTGAAG-ATCCACTG-READ2-Sequences.txt.gz
  • Related