Home > Net >  Processing large strings only a lot of numbers in C#
Processing large strings only a lot of numbers in C#

Time:01-12

We need to process a lot of strings, which contain many numbers (formatted as strings) and are separated by an arbitrary character.

Ex: numbers separated by a space:

var s = "1234.555 43434.43 434.436 85656.253 564.5656 <etc.>"

Now we are using String.Split to first create an array of strings (containing each number) and then parse these using double.Parse() to a numeric type.

var stringArr = s.Split(separator, StringSplitOptions.RemoveEmptyEntries);

// Some iteration/loop in which at some point this routine is called
var d = double.Parse(stringArr[i]);

But we notice this takes up a lot of resources (memory allocation and CPU), so we would like to create at least a less memory-consuming solution.

The first step is to enumerate the number occurrences in the string and then parse the substrings to a number.

For enumerating, we already found a nice solution on SO: What are the alternatives to Split a string in c# that don't use String.Split()

However, this routine then uses Substring to read a parting string from the entire string, so I would expect this results (again) in an extra allocation of memory to be able to use/read that partial string...

If so (regarding memory allocation), is there a way to only parse the found part directly from the original string (without allocating a new string) to a numeric value...

I don't mind using unsafe code, if necessary... It will only be marked unsafe during the brief period the numbers are extracted.

CodePudding user response:

separated by an arbitrary character

this forces you to either only handle integers, or to enforce some limitations on what characters can be used as separators and/or decimal separator. For example

1 234.567

might be interpreted as either a single number, two number, or three numbers depending on what characters you want to allow as separators. Note that the thousand separator and decimal separators depend on culture just to make things more complicated.

If you want to minimize memory usage, Double.Parse has a overload that takes a ReadOnlySpan<char> s as input. So just iterate over your string, once you encounter a number character you save the index and iterate to the next non-number character, and use Span<T>.Slice to create a span over the number digits and parse these. Continue until you have parsed all characters.

But you will still need to define what characters counts as part of a number or not. Note that you might want to Parse with the CultureInfo.InvariantCulture to get consistent behavior regardless of region.

CodePudding user response:

I fiddled something.

Based on the assumption, that the "arbitrary" split char is known.

I also added the variant as entrypoint to a DataFlowPipeline, because ReadOnlySpan has some quirks with async and enumeration :D

It is a naive approach to give an idea. You may want to add some sanity checks etc.

using System;
using System.Collections.Generic;
using System.Threading.Tasks.Dataflow;
using System.Threading.Tasks;

public class Program
{
    public static async Task Main()
    {
        var s = "1234.555 43434.43 434.436 85656.253 564.5656";
        //foreach( var d in Parse(s, ' ') ) Console.WriteLine(d);
        var ab = new ActionBlock<double>(d => Console.WriteLine(d));
        ParseAndEnqueue(s, ' ', ab);
        await ab.Completion;
    }
    
    public static IEnumerable<double> Parse(ReadOnlySpan<char> input, char splitChar)
    {
        var result = new List<double>();
        int nextIndex = input.IndexOf(splitChar);
        while( nextIndex >= 0 )
        {
            if(double.TryParse(input[..nextIndex], out double myVal)) result.Add(myVal);
            if( nextIndex 1 >= input.Length ) break;
            input = input[(nextIndex 1)..];
            nextIndex = input.IndexOf(splitChar);
            if( nextIndex < 0 )
                if(double.TryParse(input, out double myLastVal)) result.Add(myLastVal);
        }
        
        return result;
    }
    
    public static void ParseAndEnqueue(ReadOnlySpan<char> input, char splitChar, ActionBlock<double> ab)
    {
        int nextIndex = input.IndexOf(splitChar);
        while( nextIndex >= 0 )
        {
            if(double.TryParse(input[..nextIndex], out double myVal)) ab.Post(myVal);
            if( nextIndex 1 >= input.Length ) break;
            input = input[(nextIndex 1)..];
            nextIndex = input.IndexOf(splitChar);
            if( nextIndex < 0 )
                if(double.TryParse(input, out double myLastVal)) ab.Post(myLastVal);
        }
        ab.Complete();
    }
}

=> Fiddle

  • Related