Home > other >  Why a Wikipedia API query returns HTML content instead of XML?
Why a Wikipedia API query returns HTML content instead of XML?

Time:12-01

I tried to use the Wikipedia API in my app, to display the things that we type on a TextBox.
After we click the Button, the API should return an XML file, based on what I typed on the TextBox.
However, when I use WebClient's DownloadString() method, the method returns HTML content instead of XML; why this is happening?
when I use the URL in a WebBrowser, it can be opened and displayed correctly.

Here's my code:

private void button1_Click_1(object sender, EventArgs e)
{
    var webclient = new WebClient();
    var pageSourceCode = webclient.DownloadString("http://id.wikipedia.org/w/api.php?Format=xml&action=query&prop=extracts&titles="   textBox1.Text   "&redirects=true");

    var doc = new XmlDocument();
    doc.LoadXml(pageSourceCode);

    //This line causes an exception, because it's HTML
    var fnode = doc.GetElementsByTagName("extract")[0];

    try
    {
        string ss = fnode.InnerText;
        Regex regex = new Regex("\\<[^\\>]*\\>");
        string.Format("Before: {0}", ss);
        ss = regex.Replace(ss, string.Empty);
        string result = string.Format(ss);
        richTextBox1.Text = result;
    }
    catch (Exception)
    {
        richTextBox1.Text = "error";
    }
}

I cannot figure out why the content is HTML.

CodePudding user response:

The parameters in the query are case-sensitive.

https://id.wikipedia.org/w/api.php?
  Format=xml&action=query&prop=extracts&titles="   textBox1.Text   "&redirects=true"

Here, the format field is written with a capital letter, Format, so the parser doesn't recognize it and it's assumed you have not specified a format.
In this case, what you get back is an HTML page that describes the contents of a successful query, informs that a format has not been specified and adds a sample result in JSON format.

The protocol should be set to https

Also, consider that WebClient is disposable, so declare it with an using statement:

string pageSourceCode = string.Empty;

using (var client = new WebClient()) {
    pageSourceCode = client.DownloadString(
        "https://id.wikipedia.org/w/api.php?format=xml&action=query&prop=extracts&titles="   
        textBox1.Text   
        "&redirects=true");
}

if (string.IsNullOrEmpty(pageSourceCode)) return;

// The rest

I'd suggest replacing WebClient with HttpClient as soon as possible, since the former has been marked as obsolete in more recent versions of .NET

With this class, your code could look like this:
(Note the async / await keywords. A HttpClient object is created once and used many times)

using System.Net.Http;

// Simplified but functional declaration 
private static readonly HttpClient client = new HttpClient();
private async void button1_Click(object sender, EventArgs e)
{
    string pageSourceCode = await client.GetStringAsync("[The query]");

    if (string.IsNullOrEmpty(pageSourceCode)) return;

    // The rest
}
  • Related