Home > Mobile >  How do I extract a list of root domains from a list of subdomain in bash?
How do I extract a list of root domains from a list of subdomain in bash?

Time:11-04

I have a list that looks something like this -

mail.google.com
mail.google.co.uk
mail.google.org
my.mail.yahoo.co.nz
my.mail.google.gov
mail.aol.gov.uk

I need to use Bash to get the list to look like this -

google.com
google.co.uk
google.org
yahoo.co.nz
google.gov
aol.gov.uk

I tried following the top two answers from this. The first really doesn't work in my case since I don't have any slashes. The second kind of works but for something like mail.google.co.uk I get co.uk.

CodePudding user response:

If you just want to remove the mail. part from the beginning, with sed

sed 's/^.*mail\.//' file.txt

CodePudding user response:

So, it's always best to state your parsing rules out loud. Here's one way to start for you.

Starting from the right to left, there is an optional two letter country code, then a two or three letter domain, then a string of "not-dots". Everything to the left of that is discarded.

With that in mind, let's try

[^.] \.[A-Za-z]{2,3}(\.[A-Za-z]{2})?$

Trying this out, Regex101 gives exactly the results you desire from the example provided. Notice this does not focus on removing prefixes, it focuses on finding the meat on the right of the expressions.

  • Related