Home > database >  Reliable method of bringing all domain names to the same format
Reliable method of bringing all domain names to the same format

Time:10-22

I'm dealing with a database of about a billion distinct websites. The problem is that it contains all kinds of different spellings:

example.com
subdomain.example.com
http://example.com
https://example.com
https://example.com/
https://sub.example.com
...

Also, some domains are outdated and have permanent redirects.

How do I clean this mess?

I'm using curl {domain} -s -L -I -o /dev/null -w '%{url_effective}' to get effective url, the results are promising.

Some problems I've noticed:

  1. Some domains may have JS-based redirect that curl can't resolve. Too few to bother about.
  2. When both http and https protocols are available it returns http one. Would be nice to prioritize https.
  3. Non domain strings are also being resolved e.g. curl notadomain -s -L -I -o /dev/null -w '%{url_effective}' > http://notadomain/. It would be better if curl thrown errors in such cases.

How do I solve the above problems? Especially the last one.

Are there any more drawbacks in this solution that I don't see right now?

An alternative idea that I had is to resolve server ip behind each domain.

CodePudding user response:

Are there any more drawbacks in this solution that I don't see right now?

Can't answer that one very well without knowing what's and why's you are doing this.

curl does not need to throw errors.
But it does give you enough info to make that determination on your own.
These are the returned curl values.

content_type
http_code
header_size
request_size
filetime
ssl_verify_result
redirect_count
total_time
namelookup_time
connect_time
pretransfer_time
size_upload
size_download
speed_download
speed_upload
download_content_length
upload_content_length
starttransfer_time
redirect_time
redirect_url
primary_ip
certinfo
request_header
response_headers

These are typical values for an HTTP request that redirects to HTTPS

content_type = text/html; charset=UTF-8
http_code = 200
header_size = 342
request_size = 98
filetime = -1
ssl_verify_result = 20
redirect_count = 1
total_time = 0.202927
namelookup_time = 0.000971
connect_time = 0.059346
pretransfer_time = 0.094196
size_upload = 0.0
size_download = 4753.0
speed_download = 23422.0
speed_upload = 0.0
download_content_length = -1.0
upload_content_length = 0.0
starttransfer_time = 0.199499
redirect_time = 0.060231
redirect_url = 
primary_ip = 99.999.999.999
certinfo = primary_port = 443,  local_ip = 88.888.888.888, local_port = 55530,

HTTP/1.1 301 Moved Permanently
Date: Fri, 21 Oct 2022 00:44:18 GMT
Server: Apache
Location: https://example.com/
Content-Length: 227
Content-Type: text/html; charset=iso-8859-1

HTTP/1.1 200 OK
Date: Fri, 21 Oct 2022 00:44:18 GMT
Server: Apache
Vary: User-Agent
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8

Which is quite different than a url with no DNS

content_type = NULL
http_code = 0
header_size = 0
request_size = 0
filetime = -1
ssl_verify_result = 0
redirect_count = 0
total_time = 0.221879
namelookup_time = 0.0
connect_time = 0.0
pretransfer_time = 0.0
size_upload = 0.0
size_download = 0.0
speed_download = 0.0
speed_upload = 0.0
download_content_length = -1.0
upload_content_length = -1.0
starttransfer_time = 0.0
redirect_time = 0.0
redirect_url = 
primary_ip = 
certinfo = primary_port = 0, local_ip = , local_port = 0
  • Related