Removing Duplicate Domain URLs From the Text File Using Bash-CodePudding

Text file

https://www.google.com/1/
https://www.google.com/2/
https://www.google.com
https://www.bing.com
https://www.bing.com/2/
https://www.bing.com/3/

Expected Output:

https://www.google.com/1/
https://www.bing.com

What I Tried

awk -F'/' '!a[$3]  ' $file;

Output

https://www.google.com/1/
https://www.google.com
https://www.bing.com
https://www.bing.com/2/

I already tried various codes and none of them work as expected. I just want to pick only one unique domain URL per domain from the list.

Please tell me how I can do it by using the Bash script or Python.

PS: I want to filter and save full URLs from the list and not only the root domain.

CodePudding user response：

With awk and / as field separator:

awk -F '/' '!seen[$3]  ' file

If your file contains Windows line breaks (carriage returns) then I suggest:

dos2unix < file | awk -F '/' '!seen[$3]  '

Output:

https://www.google.com/1/
https://www.bing.com