Home > front end >  Searching XML document for a list of values
Searching XML document for a list of values

Time:10-24

I have an XML file containing around 7000 total nodes (one node per line, no cascading nodes) and each node has around 15 to 20 attributes holding decimal values. The xml file size is around 3 to 4 Mb. In each node, the symbol attribute has a unique value.

The goal is to search the nodes by matching a 'symbol' attribute.

I have the following listed method, which takes a list of symbols as input (symbolList). To perform the search, XPathDocument loads the XML file from the hard drive, performs the search for each symbol in the loop, and returns the result in the form of a dictionary. These symbols (input) can be either 10 or 100 etc (It's not fixed)

To perform the search, I run a for-each loop for each symbol.

Questions:

(1) What will be an alternative efficient way to search all symbols in one shot and remove the loop which is searching one symbol at a time.

In the below code, I am not happy with the efficiency. XPathNavigator executes a search for one symbol at a time in the loop, it retrieves the matching node, reads the attribute values, and adds values in the collection. I want to remove the loop, which is searching for one symbol at a time.

I thought about building one XPath query by adding all symbols with 'or' conditions, but when I have 100 or so symbols to search, it can be a big XPath query. Is there any better solution to minimize the number of scans?

(2) How to take benefit of XPath query "compilation" for this dynamic search?

I can compile the XPath queries to build XPathExpression, but that can only be helpful when my XPath remains the same for multiple scans, and I did not find a way to compile a query where I can feed the search @parameter value to a compiled query. Is there a way or any example to use the Xslt template (as a string) with parameters?

(3) Any other suggestion to reduce CPU cycles and make this code run faster than current? I am not saying this code is slow, but I wanted to make it as fastest as possible.

Xml Document Sample:

<?xml version="1.0" encoding="utf-8"?>
<items>
  <item symbol="ABC" val1="46.21717" val2="152.39" val3="158.121" />
  <item symbol="CJKM" val1="51.21659" val2="49.8" val3="57.57" />
  <item symbol="FWML" val1="67.99509" val2="9.75" val3="9.84" />
  <item symbol="JSHR" val1="48.67459" val2="2.27" val3="2.9" />
  <item symbol="DIBG" val1="53.60444" val2="26.04" val3="28" />
  <item symbol="GHLH" val1="42.31754" val2="0.1016" val3="0.1192" />
  <item symbol="ICWE" val1="58.39788" val2="3.855" val3="3.99" />
  <item symbol="LPVN" val1="47.03581" val2="19.22" val3="20.15" />
  <item symbol="MCAT" val1="57.83422" val2="23.0969" val3="26.59" />
  <item symbol="ZYXI" val1="54.94584" val2="11.6784" val3="12.9" />
</items>

C# Code:

using System.Collections.Generic;
using System.IO;
using System.Xml.XPath;

namespace Library
{
    public class Info
    {         
        public float Val1 { get; set; }
        public float Val2 { get; set; }
        public float Val2 { get; set; }
     }

    public class Technical
    {    
        public Dictionary<string, Info> SearchForSymbols(HashSet<string> symbolList)
        {
            Dictionary<string, Info> dictSearchResult = new Dictionary<string, Info>();

            if (symbolList.Count == 0)
            {
                return dictSearchResult;
            }

            FileInfo fileInfo = new FileInfo(Path.Combine(CommonObjects.Constant.SettingsFolderPath, CommonObjects.Constant.TechnicalFileName));

            XPathDocument document = new XPathDocument(fileInfo.FullName);
            XPathNavigator navigator = document.CreateNavigator();

            string symbol;
            float val1;
            float val2;
            float val3;

            XPathNavigator node;
            Info info;

            foreach (string item in symbolList)
            {
                XPathExpression expression = navigator.Compile($"//items/item[@symbol='{item}']");

                node = navigator.SelectSingleNode(expression);

                if (node == null)
                {
                    continue;
                }

                symbol = node.GetAttribute("symbol", "");

                if (string.IsNullOrEmpty(symbol))
                {
                    continue;
                }

                info = new Info();

                // Get Value 1
                if (float.TryParse(node.GetAttribute("val1", ""), out val1))
                {
                    info.Val1 = val1;
                }

                // Get Value 2
                if (float.TryParse(node.GetAttribute("val2", ""), out val2))
                {
                    info.Val2 = val2;
                }

                // Get Value 3
                if (float.TryParse(node.GetAttribute("val3", ""), out val3))
                {
                    info.Val3 = val3;
                }

                if (!dictSearchResult.ContainsKey(symbol))
                {
                    dictSearchResult.Add(symbol, info);
                }
            }

            return dictSearchResult;
        }
    }
}

CodePudding user response:

This is basically a join query, and your algorithm is a nested loop join, which is definitely not ideal. The fact that you are recompiling the XPath expression within the loop exacerbates the problem.

With 100 symbols to search, it would be best to put the symbols in some kind of lookup-structure (e.g. a HashSet), and do a single scan over the input testing each item to see if its key is present in the HashSet.

Ideally you don't want to be splitting the logic over two different languages (C# and XPath). I don't know how high the call overhead is with the Microsoft technology but it's bound to be appreciable.

With XPath 3.1 (available in SaxonCS) you do this all in a single XPath expression:

let $symbols := map{"AAA":1, "BBB":1, "CCC":1}
return //items/item[map:contains($symbols, @symbol)] 
   ! map{ "symbol": string(@symbol),
          "val1": number(@val1),
          "val2": number(@val2),
          "val3": number(@val2) }

and then there's a single call from C# to XPath, with some manipulation in the API layer to supply the input list of symbols and extract the output.

However, if you want to stick with your current technology there are still considerable improvements possible by doing a single scan of the data and testing each item against the dictionary of required symbols.

CodePudding user response:

It may be a totally different implementation for your problem but have you tried to load your XML document into a DataSet. Something like this :

 Dictionary<string, Info> dictSearchResult = new Dictionary<string, Info>();
 DataSet ds = new DataSet();
 ds.ReadXml(filePath);
 DataTable dt = ds.Tables[0];

Your DataTable will look like this

And then search for your symbols using LINQ something like this :

 IEnumerable<DataRow> searchedSymbolsRows = from r in dt.AsEnumerable()
                       join s in symbolList 
                       on r.Field<string>("symbol") equals s
                       select r;

Then you can do whatever you want with your DataRow collection for example

    foreach (DataRow item in searchedSymbolsRows)
    {
        string symbol = item[0].ToString();
        float val1;
        float val2;
        float val3;
        var symbolInfo = new Info
        {
            Val1 = float.TryParse(item[1].ToString(), NumberStyles.Any, CultureInfo.InvariantCulture, out val1) ? val1 : 0,
            Val2 = float.TryParse(item[2].ToString(), NumberStyles.Any, CultureInfo.InvariantCulture, out val2) ? val2 : 0,
            Val3 = float.TryParse(item[3].ToString(), NumberStyles.Any, CultureInfo.InvariantCulture, out val3) ? val3 : 0,
        };

        dictSearchResult.Add(symbol,symbolInfo);

        Console.WriteLine($"Symbol {symbol} added to dictionary");

    }

Note : I don't know if this method will have a better performance than your actual implmentation but it can be another choice.

CodePudding user response:

My preference is to use XML Linq with a dictionary :

using System;
using System.Linq;
using System.Xml;
using System.Xml.Linq;
using System.Text;
using System.Collections;
using System.Collections.Generic;

namespace ConsoleApp2
{
    class Program
    {
        const string FILENAME = @"c:\temp\test.xml";
        static void Main(string[] args)
        {
            XDocument doc = XDocument.Load(FILENAME);
            Dictionary<string, Info> dict = doc.Descendants("item")
                .Select(x => new Info() { symbol = (string)x.Attribute("symbol"), Val1 = (float)x.Attribute("val1"), Val2 = (float)x.Attribute("val2"), Val3 = (float)x.Attribute("val3") })
                .GroupBy(x => x.symbol, y => y)
                .ToDictionary(x => x.Key, y => y.FirstOrDefault());
        }
 
    }
        public class Info
        {
            public string symbol { get; set; }
            public float Val1 { get; set; }
            public float Val2 { get; set; }
            public float Val3 { get; set; }
        }

    }
  • Related