Home > Enterprise >  Iterating through all sitemap XML nodes to get website URLs
Iterating through all sitemap XML nodes to get website URLs

Time:12-06

If I have a sitemap xml file that contains URLs or other sitemaps, what is the best c# code to get all website URLs?

Also do I need to check file schemas for iteration?

Sitemap Like:

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" generatedBy="ANY">
<sitemap>
<loc>https://www.example.com/forum-categories-sitemap.xml</loc>
<lastmod>2022-11-04</lastmod>
</sitemap>
<sitemap>
<loc>https://www.example.com/forum-posts-sitemap.xml</loc>
<lastmod>2022-11-04</lastmod>
</sitemap>
<sitemap>
<loc>https://www.example.com/blog-posts-sitemap.xml</loc>
<lastmod>2022-11-04</lastmod>
</sitemap>
<sitemap>
<loc>https://www.example.com/blog-categories-sitemap.xml</loc>
<lastmod>2022-11-04</lastmod>
</sitemap>
<sitemap>
<loc>https://www.example.com/pages-sitemap.xml</loc>
<lastmod>2022-11-03</lastmod>
</sitemap>
<url>
<loc>https://www.example.com/post/somelink</loc>
<lastmod>2022-09-15</lastmod>
</url>
<url>
<loc>https://www.example.com/post/otherlink</loc>
<lastmod>2022-09-05</lastmod>
</url>
</sitemapindex>

CodePudding user response:

You have a couple of choices depending on your usage,

  1. use XMLDocument.
XmlDocument doc = new XmlDocument();

doc.Load("<your site map URL>"); //or you can doc.LoadXml("xml sitemap string")
var urls = doc.GetElementsByTagName("loc");

foreach (XmlElement loc in urls)
{
    Console.WriteLine(loc.InnerText);
}
  1. Use XElement and LINQ (which you must include the namespace for it to work).
XElement root = XElement.Load("<Your sitemap URL>"); //or you can XElement.Parse("xml sitemap string")

XNamespace ns = "http://www.sitemaps.org/schemas/sitemap/0.9";

var urls = from urlTag in root.Descendants(ns   "loc")
                select urlTag.Value;

foreach (var loc in urls)
{
    Console.WriteLine(loc);
}

Hope this helps.

CodePudding user response:

I solve the problem by using httpclient to load the sitemap instead of XElement.Load, then if the node is another sitemap it will load the child sitemap. I was looking for an easier way but this best I could do to accomplish this.

  public async static Task<HttpResponseMessage> getresponseMessage(string link)
        {
            var httpClient = new HttpClient(new HttpClientHandler { AllowAutoRedirect = false });
            httpClient.DefaultRequestHeaders.Add("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36");
            var responseMessage = await httpClient.GetAsync(link);
            return responseMessage;
        }

        public async static Task<List<string>> getsitemapurls(string link)
        {
            List<string> res = new List<string>();
            var responseMessage = await getresponseMessage(link);
            string responseContent = await responseMessage.Content.ReadAsStringAsync();

            XmlDocument doc = new XmlDocument();

            doc.LoadXml(responseContent); 
            var urls = doc.GetElementsByTagName("loc");

            foreach (XmlElement loc in urls)
            {
                string ParentNodeName = loc.ParentNode!.Name.ToString();
                if (ParentNodeName.ToLower() == "sitemap")
                {
                    if (loc.InnerText is not null)
                    {
                        List<string> ress = await getsitemapurls(loc.InnerText.ToString());
                        res.AddRange(ress);
                    }

                }
                else if (ParentNodeName.ToLower() == "url")
                {
                    if (loc.InnerText is not null) 
                        res.Add(loc.InnerText.ToString());

                }
            }
            return res;
        }

  • Related