If I have a sitemap xml file that contains URLs or other sitemaps, what is the best c# code to get all website URLs?
Also do I need to check file schemas for iteration?
Sitemap Like:
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" generatedBy="ANY">
<sitemap>
<loc>https://www.example.com/forum-categories-sitemap.xml</loc>
<lastmod>2022-11-04</lastmod>
</sitemap>
<sitemap>
<loc>https://www.example.com/forum-posts-sitemap.xml</loc>
<lastmod>2022-11-04</lastmod>
</sitemap>
<sitemap>
<loc>https://www.example.com/blog-posts-sitemap.xml</loc>
<lastmod>2022-11-04</lastmod>
</sitemap>
<sitemap>
<loc>https://www.example.com/blog-categories-sitemap.xml</loc>
<lastmod>2022-11-04</lastmod>
</sitemap>
<sitemap>
<loc>https://www.example.com/pages-sitemap.xml</loc>
<lastmod>2022-11-03</lastmod>
</sitemap>
<url>
<loc>https://www.example.com/post/somelink</loc>
<lastmod>2022-09-15</lastmod>
</url>
<url>
<loc>https://www.example.com/post/otherlink</loc>
<lastmod>2022-09-05</lastmod>
</url>
</sitemapindex>
CodePudding user response:
You have a couple of choices depending on your usage,
- use XMLDocument.
XmlDocument doc = new XmlDocument();
doc.Load("<your site map URL>"); //or you can doc.LoadXml("xml sitemap string")
var urls = doc.GetElementsByTagName("loc");
foreach (XmlElement loc in urls)
{
Console.WriteLine(loc.InnerText);
}
- Use XElement and LINQ (which you must include the namespace for it to work).
XElement root = XElement.Load("<Your sitemap URL>"); //or you can XElement.Parse("xml sitemap string")
XNamespace ns = "http://www.sitemaps.org/schemas/sitemap/0.9";
var urls = from urlTag in root.Descendants(ns "loc")
select urlTag.Value;
foreach (var loc in urls)
{
Console.WriteLine(loc);
}
Hope this helps.
CodePudding user response:
I solve the problem by using httpclient to load the sitemap instead of XElement.Load, then if the node is another sitemap it will load the child sitemap. I was looking for an easier way but this best I could do to accomplish this.
public async static Task<HttpResponseMessage> getresponseMessage(string link)
{
var httpClient = new HttpClient(new HttpClientHandler { AllowAutoRedirect = false });
httpClient.DefaultRequestHeaders.Add("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36");
var responseMessage = await httpClient.GetAsync(link);
return responseMessage;
}
public async static Task<List<string>> getsitemapurls(string link)
{
List<string> res = new List<string>();
var responseMessage = await getresponseMessage(link);
string responseContent = await responseMessage.Content.ReadAsStringAsync();
XmlDocument doc = new XmlDocument();
doc.LoadXml(responseContent);
var urls = doc.GetElementsByTagName("loc");
foreach (XmlElement loc in urls)
{
string ParentNodeName = loc.ParentNode!.Name.ToString();
if (ParentNodeName.ToLower() == "sitemap")
{
if (loc.InnerText is not null)
{
List<string> ress = await getsitemapurls(loc.InnerText.ToString());
res.AddRange(ress);
}
}
else if (ParentNodeName.ToLower() == "url")
{
if (loc.InnerText is not null)
res.Add(loc.InnerText.ToString());
}
}
return res;
}