Home > OS >  Is It possible to find out what are the common part in String List
Is It possible to find out what are the common part in String List

Time:05-10

I was working on finding out the Common string part in the String list. If we take a sample data set

private readonly List<string> Xpath = new List<string>()
{   
    "BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(1)>H2:nth-of-type(1)",
    "BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(2)>H2:nth-of-type(1)",
    "BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(3)>H2:nth-of-type(1)",
    "BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(4)>H2:nth-of-type(1)",
    "BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(5)>H2:nth-of-type(1)",
    "BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(6)>H2:nth-of-type(1)",
    "BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(7)>H2:nth-of-type(1)",
    "BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(8)>H2:nth-of-type(1)",
    "BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(9)>H2:nth-of-type(1)"
};

From this, I want to find out to which children these are similar. data is an Xpath list. Programmatically I should be able to tell

Expected output:

BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV

In order to get this What I did was like this. I separate each item by > and then create a list of items for each dataset originally.

Then using this find out what are the unique items

private IEnumerable<T> GetCommonItems<T>(IEnumerable<T>[] lists)
{
    HashSet<T> hs = new HashSet<T>(lists.First());
    for (int i = 1; i < lists.Length; i  )
    {
        hs.IntersectWith(lists[i]);
    }
    return hs;
}

Able to find out the unique values and create a dataset again. But what happened is if this contains Ex:- Div in two places and it also in every originally dataset even then this method will pick up only one Div.

From then I would get something like this:

BODY>MAIN:nth-of-type(1)>DIV>SECTION

But I need this

BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of- type(3)>DIV>ARTICLE>DIV>DIV>DIV

CodePudding user response:

Disclaimer: This is not the most performant solution but it works :)

  • Let's start with splitting the first path by > character
  • Do the same with all the paths
char separator = '>';
IEnumerable<string> firstPathChunks = Xpath[0].Split(separator);
var chunks = Xpath.Select(path => path.Split(separator).ToList()).ToArray();
  • Iterate through the firstPathChunks
    • Iterate through the chunks
    • if there is a match then remove the first element
    • if all first element is removed then append the matching prefix to sb
void Process(StringBuilder sb)
{
    foreach (var pathChunk in firstPathChunks)
    {
        foreach (var chunk in chunks)
        {
            if (chunk[0] != pathChunk)
            {
                return;
            }
            chunk.RemoveAt(0);
        }
        sb.Append(pathChunk   separator);
    }
}

Sample usage

var sb = new StringBuilder();
Process(sb);
Console.WriteLine(sb.ToString());

Output

BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>

CodePudding user response:

Parsing the string by the seperator ">" is a good idea. Instead of then creating a list of unique items you should create a list of all items contained in the string which would result in

{
    "BODY",
    "MAIN:nth-of-type(1)",
    "DIV",
    "SECTTION",
    "DIV",
    ...
}

for the first entry of your XPath list.

This way you create a List<List<string>> containing every element of each entry of your XPath list. You then can compare all first elements of the inner lists. If they are equal save that elements value to you output and proceed with all second elements and so on until you find an element that is not equal in all outer lists.

  • Related