I have a long list of URLs stored in a text files which I will go through and download. But before doing this I want to remove the duplicate URLs from the list. One thing to note is that some of the URLs look different but infact lead to the same page. The unique elements in the URL (aside from the domain and path) are the first 2 parameters in the query string. So for example, my text file would look like this:
https://www.example.com/page1.html?id=12345&key=dnks93jd&user=399494&group=23
https://www.example.com/page1.html?id=15645&key=fkldf032&user=250643&group=12
https://www.example.com/page1.html?id=26327&key=xkd9c03n&user=399494&group=15
https://www.example.com/page1.html?id=12345&key=dnks93jd&user=454665&group=12
If a unique URL is defined up to the second query string (key) then lines 1 and 4 are a duplicate. I would like to completely remove the duplicate lines, so not even keeping one. In the example above, lines 2 and 3 would remain and the 1 and 4 get deleted.
How can I achieve this using basic command line tools?
CodePudding user response:
Using awk
:
$ awk -F'[?&]' 'FNR == NR { url[$1,$2,$3] ; next } url[$1,$2,$3] == 1' urls.txt urls.txt
https://www.example.com/page1.html?id=15645&key=fkldf032&user=250643&group=12
https://www.example.com/page1.html?id=26327&key=xkd9c03n&user=399494&group=15
Reads the file twice; first time to keep a count of how many times the bits you're interested in occur, the second time to print only those that showed up once.
CodePudding user response:
To shorten the code from other answer:
awk -F\& 'FNR == NR { url[$1,$2] ; next } url[$1,$2] == 1' urls.txt urls.txt