Home > other >  Convert string to number array c#
Convert string to number array c#

Time:12-10

I have a follow string example

0 0 1 2.33 4
2.1 2 11 2

There are many ways to convert it to an array, but I need the fastest one, because files can contain 1 billion elements.

string can contain an indefinite number of spaces between numbers

i'am trying

 static void Main()
        {
            string str = "\n\n\n 1 2 3   \r  2322.2 3 4 \n  0 0 ";

            byte[] byteArray = Encoding.ASCII.GetBytes(str);
            MemoryStream stream = new MemoryStream(byteArray);

            var values = ReadNumbers(stream);
            
        }

 public static IEnumerable<object> ReadNumbers(Stream st)
        {
            var buffer = new StringBuilder();
            using (var sr = new StreamReader(st))
            {
                while (!sr.EndOfStream)
                {
                    char digit = (char)sr.Read();
                    if (!char.IsDigit(digit) && digit != '.')
                    {
                        if (buffer.Length == 0) continue;
                        double ret = double.Parse(buffer.ToString() , culture);
                        buffer.Clear();
                        yield return ret;
                    }
                    else
                    {
                        buffer.Append(digit);
                    }
                }

                if (buffer.Length != 0)
                {
                    double ret = double.Parse(buffer.ToString() , culture);
                    buffer.Clear();
                    yield return ret;
                }
            }
        }

CodePudding user response:

There are a few things you can do to improve the performance of your code. First, you can use the Split method to split the string into an array of strings, where each element of the array is a number in the string. This will be faster than reading each character of the string one at a time and checking if it is a digit.

Next, you can use double.TryParse to parse each element of the array into a double, rather than using double.Parse and catching any potential exceptions. TryParse will be faster because it does not throw an exception if the string is not a valid double.

Here is an example of how you could implement this:

public static IEnumerable<double> ReadNumbers(string str)
{
    string[] parts = str.Split(new[] {' ', '\n', '\r', '\t'}, StringSplitOptions.RemoveEmptyEntries);
    foreach (string part in parts)
    {
        if (double.TryParse(part, NumberStyles.Any, CultureInfo.InvariantCulture, out double value))
        {
            yield return value;
        }
    }
}

CodePudding user response:

I'd rather suggest the simpliest solution first and haunt for nano-seconds if there really is a problem with that code.

var doubles = myInput.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries).Select(x => double.Parse(x, whateverCulture))

Do that for every line in your file, not for the entire file at once, as reading such a huge file at once may crush your memory.

Pretty easy to understand. Afterwards perform a benchmark-test with your data and see if it really affects performance when trying to parse the data. However chances are the actual bottleneck is reading that huge file- which essentially is a IO-thing.

CodePudding user response:

You can improve your solution by trying to avoid creating many objects on the heap. Especially buffer.ToString() is called repeatedly and creates new strings. You can use a ReadOnlySpan<char> struct to slice the string and at the same time avoid heap allocations. A span provides pointers into the original string without making copies of it or parts of it when slicing.

Also do not return the doubles as object, as this will box them. I.e., it will store them on the heap. See: Boxing and Unboxing (C# Programming Guide). If you prefer your solution over mine, use an IEnumerable<double> as return type of your method.

The use of ReadOnlySpans; however, has the disadvantage that it cannot be used in iterator methods. The reason is that a ReadOnlySpan must be allocated on the stack, but an iterator method wraps its state in a class. If you try, you will get the Compiler Error CS4013:

Instance of type cannot be used inside a nested function, query expression, iterator block or async method

Therefore, we must either store the numbers in a collection or consume them in-place. Since I don't know what you want to do with the numbers, I use the former approach:

public static List<double> ReadNumbers(string input)
{
    ReadOnlySpan<char> inputSpan = input.AsSpan();
    int start = 0;
    bool isNumber = false;
    var list = new List<double>(); // Improve by passing the expected maximum length.
    int i;
    for (i = 0; i < inputSpan.Length; i  ) {
        char c = inputSpan[i];
        bool isDigit = Char.IsDigit(c);
        if (isDigit && !isNumber) {
            start = i;
            isNumber = true;
        } else if (isNumber && !isDigit && c != '.') {
            isNumber = false;
            if (Double.TryParse(inputSpan[start..i], CultureInfo.InvariantCulture, out double d)) {
                list.Add(d);
            }
        }
    }
    if (isNumber) {
        if (Double.TryParse(inputSpan[start..i], CultureInfo.InvariantCulture, out double d)) {
            list.Add(d);
        }
    }
    return list;
}

inputSpan[start..i] creates a slice as ReadOnlySpan<char>.

Test

string str = "\n\n\n 1 2 3   \r  2322.2 3 4 \n  0 0 ";
foreach (double d in ReadNumbers(str)) {
    Console.WriteLine(d);
}

But whenever you are asking for speed, you must run benchmarks to compare the different approaches. Very often what seems a superior solution may fail in the benchmark.

See also: All About Span: Exploring a New .NET Mainstay

  •  Tags:  
  • c#
  • Related