Home > Software design >  Removing conjunctions from a .txt file using words in an array
Removing conjunctions from a .txt file using words in an array

Time:02-12

I was trying to remove conjunctions and punctuations from a txt file. Punctuations are removed successfully but some conjunctions remained. Here is my code:

public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        private void Form1_Load(object sender, EventArgs e)
        {
            string words = File.ReadAllText(@"C:\Users\...\Desktop\data_protection_law.txt").ToLower(new CultureInfo("en-US", false));

            string[] punctuation = { ".", "!", "?", "–", "-", "-", "/", "_", ",", ";", ":", "(", ")", "[", "]", "“", "”", "\"", "1", "2", "3", "4", "5", "6", "7", "8", "9" }; 
            string[] con_art = { "the", "a", "an", "for", "and", "or", "nor", "but", "yet", "so", "of", "to", "in", "are", "is", "on", "be", "by", "we", "he", "that", "he", "that", "because", "as", "it", "about", "were", "i", "our", "they", "with", "these", "there", "then", "them" };

            foreach (string s in punctuation)
            {
                words = words.Replace(s, "");
            }

            foreach (string s in con_art)
            {
                words = words.Replace(" "   s   " ", " ");
            }

            richTextBox1.Text = words;
        }
        
    }

I printed the words in richTextBox just to be sure. When I checked the original text, I found that some conjunctions were deleted but not all. Here is the proof of the remaining conjunctions

Original Text File

I'm going crazy, I've been trying to find the mistake myself for days, but I couldn't find it.

So where is my mistake in this code? Btw I'm just a beginner so don't be angry if I made a big mistake :)

CodePudding user response:

I think you'll need to change your search and replace style completely; it would be easiest to use regular expressions here

var rex = string.Join("|", con_art.Select(w => $@"\b{w}\b"));
words = Regex.Replace(words, rex, "", RegexOptions.IgnoreCase);

The first line of code converts your word list to a string like

\bthe\b|\ba\b|\ban\b|\bfor\b|\band\b|\bor\b|...

When used by a regular expression engine \b means "boundary between a non word character like space, punctuation, new line etc, and a word character like letters, numbers etc"; this effectively makes the search for the, a,an, for, and etc function as "whole word only" - what you're trying with your spaces (which isn't working out because sometimes your words aren't surrounded by spaces).

The vertical bar | means "or"; by supplying a list of "whole word 'the' OR whole word 'a' OR whole word 'an' ..." like this it means you don't have to Replace() over and over again in a loop

CodePudding user response:

because sometimes the word is not surrounded by spaces on both sides.

The ones you fail to replace are all at the beginning or end of line, this means that have a newline not a space

  • Related