remove url from a given string in c#-CodePudding

I tried doing this

using System;
using System.Collections.Generic;
using System.Text;

namespace UrlsDetector
{
    class UrlDetector
    {
        public static string RemoveUrl(string input)
        {
            var words = input;
            while(words.Contains("https://"))
            {
                string urlToRemove = words.Substring("https://", @" ");
                words = words.Replace("https://"   urlToRemove , @"");
            }
        }
        
    }

    class Program
    {
        static void Main()
        {
            Console.WriteLine(UrlDetector.RemoveUrl(
                "I saw a cat and a horse on https://www.youtube.com/"));

        }
    }
}

but it doesn't work

what I want to achieve is remove the entire "https://www.youtube.com/" and display "I saw a cat and a horse on"

I also want to display a message like "the sentence you input doesn't have url" if the sentence doesn't have any url. but as you can I didnt put any code to do that I just need to fix this code first but if you want to help me do that too, I gladly appreciated it.

thanks for responses.

CodePudding user response：

If you are looking for a non RegEx way to do this, here you go. But the method I encoded below assumes that a URL begins with "http://" or "https://", which means it will not work with URL's that begin with something like ftp:// or file://, although the code below can be easily modified to support that. Also, it assumes the URL path continues until it reaches either the end of the string or a white space character (like a space or a tab or a new line). Again, this can easily be modified if your requirements are different.

Also, if the string contains no URL, currently it just returns a blank string. You can modify this easily too!

using System;

public class Program
{
    public static void Main()
    {
        string str = "I saw a cat and a horse on https://www.youtube.com/";

        UrlExtraction extraction = RemoveUrl(str);
        Console.WriteLine("Original Text: "   extraction.OriginalText);
        Console.WriteLine();
        Console.WriteLine("Url: "   extraction.ExtractedUrl);
        Console.WriteLine("Text: "   extraction.TextWithoutUrl);
    }

    private static UrlExtraction RemoveUrl(string str)
    {       
        if (String.IsNullOrWhiteSpace(str))
        {
            return new UrlExtraction("", "", "");
        }

        int startIndex = str.IndexOf("https://", 
                StringComparison.InvariantCultureIgnoreCase);

        if (startIndex == -1)
        {
            startIndex = str.IndexOf("http://", 
                StringComparison.InvariantCultureIgnoreCase);
        }

        if (startIndex == -1)
        {
            return new UrlExtraction(str, "", "");
        }

        int endIndex = startIndex;
        while (endIndex < str.Length && !IsWhiteSpace(str[endIndex])) 
        {           
            endIndex  ;
        }

        return new UrlExtraction(str, str.Substring(startIndex, endIndex - startIndex), 
            str.Remove(startIndex, endIndex - startIndex));
    }

    private static bool IsWhiteSpace(char c)
    {
        return 
            c == '\n' || 
            c == '\r' || 
            c == ' ' || 
            c == '\t';
    }

    private class UrlExtraction
    {
        public string ExtractedUrl {get; set;}
        public string TextWithoutUrl {get; set;}
        public string OriginalText {get; set;}

        public UrlExtraction(string originalText, string extractedUrl, 
            string textWithoutUrl)
        {
            OriginalText = originalText;
            ExtractedUrl = extractedUrl;
            TextWithoutUrl = textWithoutUrl;
        }
    }
}

CodePudding user response：

A simplified version of what you're doing. Instead of using SubString or IndexOf, I split the input into a list of strings, and remove the items that contain a URL. I iterate over the list in reverse as removing an item in a forward loop direction will skip an index.

    public static string RemoveUrl(string input)
    {
        List<string> words = input.Split(" ").ToList();
        for (int i = words.Count - 1; i >= 0; i--) 
        {
            if (words[i].StartsWith("https://")) words.RemoveAt(i);
        }
        return string.Join(" ", words);
    }

This methods advantage is avoiding SubString and Replace methods that essentially create new Strings each time they're used. In a loop this excessive string manipulation can put pressure on the Garbage Collector and bloat the Managed Heap. A Split and Join has less performance cost in comparison especially when used in a loop like this with a lot of data.

CodePudding user response：

Better way to use, split and StringBuilder. Code will be look like this. StringBuilder is optimized this kind of situation.

Pseudocode:

    var words = "I saw a cat and a horse on https://www.youtube.com/".Split(" ").ToList();
    var sb = new StringBuilder();
    foreach(var word in words){
        if(!word.StartsWith("https://")) sb.Append(word   " ");
    }
    return sb.ToString();

CodePudding user response：

Using basic string manipulation will never get you where you want to be. Using regular expressions makes this very easy for you. search for a piece of text that looks like "http(s)?:\/\/\S*[^\s\.]":

http: the text block http
(s)?: the optional (?) letter s
:\/\/: the characters ://
\S*: any amount (*) non white characters (\S)
[^\s\.]: any character that is not (^) in the list ([ ]) of characters being white characters (\s) or dot (\.). This allows you to exclude the dot at the end of a sentence from your url.

using System;
using System.Text.RegularExpressions;

namespace UrlsDetector
{
  internal class Program
  {

    static void Main(string[] args)
    {
      Console.WriteLine(UrlDetector.RemoveUrl(
          "I saw a cat and a horse on https://www.youtube.com/ and also on http://www.example.com."));
      Console.ReadLine();
    }
  }

  class UrlDetector
  {
    public static string RemoveUrl(string input)
    {

      var regex = new Regex($@"http(s)?:\/\/\S*[^\s.]");
      return regex.Replace(input, "");
    }
  }
}

Using regular expressions you can also detect matches Regex.Match(...) which allows you to detect any urls in your text.