Home > Software design >  Need to remove domains domains from sub-domans
Need to remove domains domains from sub-domans

Time:06-24

I am trying to get last 2 values from right to left from cut command

I have a large database for about 110 Million domains and subdomains.

Like

yahoo.com
mail.yahoo.com
a.yahoo.com
a.yahoo.co.uk

In simple words I am trying to remove subdomains from domains

echo a.yahoo.aa | cut -d '.' -f 2,3
yahoo.aa

but when I try

echo yahoo.aa | cut -d '.' -f 2,3
aa

it give me only aa

Required output is

yahoo.com
yahoo.com
yahoo.com
yahoo.co.uk

edit thanks anubhava for suggestion.

a TLD property is like

xxxx.xx
xxx.xx
xx.xx

i.e. a ccTLD always has 2 characters in last.

CodePudding user response:

large database for about 110 Million domains and subdomains.

Due to this I suggest using sed here, let file.txt content be

yahoo.com
mail.yahoo.com
a.yahoo.com

then

sed 's/^.*\.\([^.]*\.[^.]*\)$/\1/' file.txt

output

yahoo.com
yahoo.com
yahoo.com

Explanation: In regular expression spanning whole line (^-start, $-end) I use single capturing group which contain zero-or-more (*) non-dots followed by literal dot (\.) followed by zero-or-more non-dots which is adjacent to end of line, I replace whole line with content of that group. Disclaimer: this solution assumes there is always at least one dot in each line

(tested in GNU sed 4.2.2)

CodePudding user response:

Try

echo a.yahoo.aa | awk -F'.' '{print $NF"."$(NF-1)}'

CodePudding user response:

You are selecting only fields 2 and 3. You need to select from field 2 up to the end:

 ... | cut -d '.' -f 2-

CodePudding user response:

Long solution but a think that makes what you want to do:

Executable file domain.awk:

#! /usr/bin/awk -f

BEGIN {
    FS="."
}
{
    ret = $NF
    if (NF >= 2 && (length($(NF - 1)) == 2 || length($(NF - 1)) == 3)) {
        ret = $(NF - 1) "." ret
        if (NF >= 3) {
            ret = $(NF - 2) "." ret
        }
    } else if (NF >= 2) {
        ret = $(NF - 1) "." ret
    }
    print ret
}

with domains.lst file:

yahoo.com
mail.yahoo.com
a.yahoo.com
a.yahoo.co.uk
aus.co.au

Used like that:

./domain.awk domains.lst

Output:

yahoo.com
yahoo.com
yahoo.com
yahoo.co.uk
aus.co.au
  • Related