Remove words if all of them are in a stop words list-CodePudding

I have an array of word(s), it can contain one word or more. In case of one word, it's easy to remove it, but when choose to remove multiple words if they are all in the stop words list is difficult for me to figure it out. I prefer solving it with LINQ.

Imagin, I have this array of strings

then use 
then he
the image
and the
should be in
should be written

I want to get only

then use 
the image
should be written

So, the lines that all it words are in the stop words should be removed, while keep the lines that has mixed words.

My stop words array string[] stopWords = {"a", "an", "x", "y", "z", "this", "the", "me", "you", "our", "we", "I", "them", "then", "ours", "more", "will", "he", "she", "should", "be", "at", "on", "in", "has", "have", "and"};

Thank you,

CodePudding user response：

One way to solve this problem would be to do the following:

string[] stopWords = { "a", "an", "x", "y", "z", "this", "the", "me", "you", "our", "we", "I", "them", "ours", "more", "will", "he", "she", "should", "be", "at", "on", "in", "has", "have", "and" };

string input = """"
            then use 
            then he
            the image
            and the
            should be in
            should be written
            """";

var array = input.Split(Environment.NewLine.ToCharArray(), StringSplitOptions.RemoveEmptyEntries);

var filteredArray = array.Where(x => x.Split(' ').Any(y => !stopWords.Contains(y))).ToList();
var result = string.Join(Environment.NewLine, filteredArray);

Console.WriteLine(result);

First 2 lines are just to setup the data.

The third line converts the string into a array of lines by splitting on newline character. (Environment.NewLine ensures that the code works properly on linux as well.)

Fourth line processes each line by splitting the line on space (which gets us individual words) and then checks if there's any word that doesn't exist in stopWords list. If any of the words doesn't exist then the Where condition is satisfied and the whole line is returned in filteredArray.

Fifth line simply concatenates all individual lines to form the final result string.

The result should look something like below:

then use
then he
the image
should be written

Note that in your stopWords list, you have the word them but not then. So the second result line should not be removed.

CodePudding user response：

use Intersect method as follows:

    foreach (string word in WordsList)
    {
        List<string> splitData = word.Split(new string[] { " "}, StringSplitOptions.RemoveEmptyEntries).ToList();
        bool allOfWordsIsInStopWords = splitData.Intersect(stopWords).Count() == splitData.Count();
    }

CodePudding user response：

Acording to this initial problem description:

I have an array of word(s), it can contain one word or more. In case of one word, it's easy to remove it, but when choose to remove multiple words if they are ALL in the stop words list is difficult for me to figure it out. I prefer solving it with LINQ.

The following code resolves the sentences in bold.

using System.Text.RegularExpressions;

string[] stopWords = { "a", "an", "x", "y", "z", "this", "the", "me", "you", "our", "we", "I", "them", "ours", "more", "will", "he", "she", "should", "be", "at", "on", "in", "has", "have", "and" };

string[] inputStrings = { "then use", "then he", "the image", "and the", "should be in", "should be written" };

var wordSeparatorPattern = new Regex(@"\s ");

var outputStrings = inputStrings.Where((words) => 
{
    return wordSeparatorPattern.Split(words).Any((word) =>
    {
        return !stopWords.Contains(word);
    });
});


foreach (var item in outputStrings)
{
    Console.WriteLine(item);
}