Home > Blockchain >  How can I extract subdomains from a json file?
How can I extract subdomains from a json file?

Time:10-30

I have a long list of json file . I want to extract the subdomain of harvard.edu which is in the variable host in "host": "ceonlineb2b.hms.harvard.edu using bash . I would be happy if anyone can help out .Below is only a snippet of json file.

{
  "data": {
    "total_items": 3,
    "offset": 0,
    "limit": 1,
    "items": [
      {
        "name": "ceonlineb2b.hms.harvard.edu",
        "alexa": null,
        "cert_summary": null,
        "dns_records": {
          "A": [
            "3.221.168.206",
            "54.174.253.3"
          ],
          "AAAA": null,
          "CAA": null,
          "CNAME": [
            "hms-moodleb2b-prod.cabem.com"
          ],
          "MX": null,
          "NS": null,
          "SOA": null,
          "TXT": null,
          "SPF": null,
          "updated_at": "2021-05-14T23:12:43.332816923Z"
        },
        "hosts_enrichment": [
          {
            "ip": "3.221.168.206",
            "as_num": 14618,
            "as_org": "amazon-aes",
            "isp": "amazon.com",
            "city_name": "ashburn",
            "country": "united states",
            "country_iso_code": "us",
            "location": {
              "lat": 39.0481,
              "lon": -77.4728
            }
          },
          {
            "ip": "54.174.253.3",
            "as_num": 14618,
            "as_org": "amazon-aes",
            "isp": "amazon.com",
            "city_name": "ashburn",
            "country": "united states",
            "country_iso_code": "us",
            "location": {
              "lat": 39.0481,
              "lon": -77.4728
            }
          }
        ],
        "http_extract": {
          "cookies": [
            {
              "domain": "",
              "expire": "0001-01-01T00:00:00Z",
              "http_only": true,
              "key": "MoodleSession",
              "max_age": 0,
              "path": "/",
              "security": true,
              "value": "tqhmqc4muk513sad1bmnl3kocj"
            }
          ],
          "description": "",
          "emails": null,
          "final_redirect_url": {
            "full_uri": "https://ceonlineb2b.hms.harvard.edu/login/index.php",
            "host": "ceonlineb2b.hms.harvard.edu",
            "path": "/login/index.php"
          },
          "extracted_at": "2020-10-04T20:55:26.043777194Z",
          "favicon_sha256": "",
          "http_headers": [
            {
              "name": "date",
              "value": "Sun, 04 Oct 2020 20:55:25 GMT"
            },
            {
              "name": "content-type",
              "value": "text/html; charset=utf-8"
            },
            {
              "name": "server",
              "value": "Apache/2.4.46 () OpenSSL/1.0.2k-fips"
            },
            {
              "name": "x-powered-by",
              "value": "PHP/7.2.24"
            },
            {
              "name": "content-language",
              "value": "en"
            },
            {
              "name": "content-script-type",
              "value": "text/javascript"
            },
            {
              "name": "content-style-type",
              "value": "text/css"
            },
            {
              "name": "x-ua-compatible",
              "value": "IE=edge"
            },
            {
              "name": "cache-control",
              "value": "private, pre-check=0, post-check=0, max-age=0, no-transform"
            },
            {
              "name": "pragma",
              "value": "no-cache"
            },
            {
              "name": "expires",
              "value": ""
            },
            {
              "name": "accept-ranges",
              "value": "none"
            },
            {
              "name": "set-cookie",
              "value": "MoodleSession=tqhmqc4muk513sad1bmnl3kocj; path=/; secure;HttpOnly;Secure;SameSite=None"
            }
          ],
          "http_status_code": 200,
          "links": [
            {
              "anchor": "Forgotten your username or password?",
              "url": "https://ceonlineb2b.hms.harvard.edu/login/forgot_password.php",
              "url_host": "ceonlineb2b.hms.harvard.edu"
            },
            {
              "anchor": "Privacy Statement",
              "url": "/local/staticpage/view.php?page=privacy-statement",
              "url_host": ""
            },
            {
              "anchor": "Terms of Service",
              "url": "/local/staticpage/view.php?page=terms-of-service",
              "url_host": ""
            },
            {
              "anchor": "Copyright Information",
              "url": "/local/staticpage/view.php?page=copyright-information",
              "url_host": ""
            }
          ],
          "meta_tags": [
            {
              "name": "keywords",
              "value": "moodle, HMS Postgraduate Courses: Log in to the site"
            },
            {
              "name": "format-detection",
              "value": "telephone=no"
            },
            {
              "name": "robots",
              "value": "noindex"
            },
            {
              "name": "viewport",
              "value": "width=device-width, initial-scale=1.0"
            }
          ],
          "robots_txt": "",
          "scripts": [
            "https://ceonlineb2b.hms.harvard.edu/theme/yui_combo.php?rollup/3.17.2/yui-moodlesimple-min.js",
            "https://ceonlineb2b.hms.harvard.edu/lib/javascript.php/1589465014/lib/javascript-static.js",
            "https://ceonlineb2b.hms.harvard.edu/lib/javascript.php/1589465014/lib/requirejs/require.min.js",
            "https://ceonlineb2b.hms.harvard.edu/theme/javascript.php/hms/1589465013/footer"
          ],
          "styles": [
            "https://ceonlineb2b.hms.harvard.edu/theme/yui_combo.php?rollup/3.17.2/yui-moodlesimple-min.css",
            "https://ceonlineb2b.hms.harvard.edu/theme/styles.php/hms/1589465013_1/all"
          ],
          "title": "HMS Postgraduate Courses: Log in to the site"
        },
        "is_CNAME": null,
        "is_MX": null,
        "is_NS": null,
        "is_PTR": null,
        "is_subdomain": true,
        "name_without_suffix": "ceonlineb2b.hms.harvard",
        "updated_at": "2021-05-16T10:25:01.59086376Z",
        "user_scan_at": null,
        "whois_parsed": null,
        "security_score": {
          "score": 100
        },
        "cve_list": null,
        "technologies": [
          {
            "name": "Moodle",
            "version": ""
          },
          {
            "name": "RequireJS",
            "version": ""
          }
        ],
        "trackers": null,
        "organizations": null
      }
    ]
  }
}

CodePudding user response:

For json parsing on bash, I recommend checking out jq. It's lightweight and versatile.

We can use the -r flag to output only values.

Output the fields of each object with the keys in sorted order.  

--raw-output / -r:

The structure of the JSON you provided has the subdomain at .data.items[].http_extract.final_redirect_url.host

{
  "data": {
    "items": [
      {
        "http_extract": {
          "final_redirect_url": {
            "full_uri": "https://ceonlineb2b.hms.harvard.edu/login/index.php",
            "host": "ceonlineb2b.hms.harvard.edu",
            "path": "/login/index.php"
          },
        ...

I've saved your json to a file, se.json

Example extracting full domain with jq

jq -r '.data.items[].http_extract.final_redirect_url.host' se.json

Output

ceonlineb2b.hms.harvard.edu

To extract the subdomain, just perform a search/replace using sub().

sub(regex; tostring) sub(regex; string; flags)  

Emit the string obtained by replacing the first match of regex in the input string with tostring, after interpolation. tostring should be a jq string, and may contain references to named captures. The named captures are, in effect, presented as a JSON object (as constructed by capture) to tostring, so a reference to a captured variable named "x" would take the form: "(.x)".

Extracting subdomain using jq

jq -r '.data.items[].http_extract.final_redirect_url.host | sub(".hms.harvard.edu";"")' se.json

Output

ceonlineb2b
  • Related