I am trying to get last 2 values from right to left from cut command
I have a large database for about 110 Million domains and subdomains.
Like
yahoo.com
mail.yahoo.com
a.yahoo.com
a.yahoo.co.uk
In simple words I am trying to remove subdomains from domains
echo a.yahoo.aa | cut -d '.' -f 2,3
yahoo.aa
but when I try
echo yahoo.aa | cut -d '.' -f 2,3
aa
it give me only aa
Required output is
yahoo.com
yahoo.com
yahoo.com
yahoo.co.uk
edit thanks anubhava for suggestion.
a TLD property is like
xxxx.xx
xxx.xx
xx.xx
i.e. a ccTLD always has 2 characters in last.
CodePudding user response:
large database for about 110 Million domains and subdomains.
Due to this I suggest using sed
here, let file.txt
content be
yahoo.com
mail.yahoo.com
a.yahoo.com
then
sed 's/^.*\.\([^.]*\.[^.]*\)$/\1/' file.txt
output
yahoo.com
yahoo.com
yahoo.com
Explanation: In regular expression spanning whole line (^
-start, $
-end) I use single capturing group which contain zero-or-more (*
) non-dots followed by literal dot (\.
) followed by zero-or-more non-dots which is adjacent to end of line, I replace whole line with content of that group. Disclaimer: this solution assumes there is always at least one dot in each line
(tested in GNU sed 4.2.2)
CodePudding user response:
Try
echo a.yahoo.aa | awk -F'.' '{print $NF"."$(NF-1)}'
CodePudding user response:
You are selecting only fields 2 and 3. You need to select from field 2 up to the end:
... | cut -d '.' -f 2-
CodePudding user response:
Long solution but a think that makes what you want to do:
Executable file domain.awk
:
#! /usr/bin/awk -f
BEGIN {
FS="."
}
{
ret = $NF
if (NF >= 2 && (length($(NF - 1)) == 2 || length($(NF - 1)) == 3)) {
ret = $(NF - 1) "." ret
if (NF >= 3) {
ret = $(NF - 2) "." ret
}
} else if (NF >= 2) {
ret = $(NF - 1) "." ret
}
print ret
}
with domains.lst
file:
yahoo.com
mail.yahoo.com
a.yahoo.com
a.yahoo.co.uk
aus.co.au
Used like that:
./domain.awk domains.lst
Output:
yahoo.com
yahoo.com
yahoo.com
yahoo.co.uk
aus.co.au