Home > Blockchain >  IndexOf char within an ReadOnlySpan<byte> of UTF8 bytes
IndexOf char within an ReadOnlySpan<byte> of UTF8 bytes

Time:02-04

I'm looking for an efficient, allocation-free (!) implementation of

public static int IndexOf(this ReadOnlySpan<byte> utf8Bytes, char @char)
{
   // Should return the index of the first byte of @char within utf8Bytes
   // (not the character index of @char within the string)
}

I've not found a way to iterate through the span char by char yet. Utf8Parser does not have an overload supporting single characters. And System.Text.Encoding seems to work mostly on the entire span, and does allocate internally while doing so.

Is there any builtin functionality I haven't spotted yet? If not, can anyone think of a reasonable custom implementation?

CodePudding user response:

You can negate allocations with stackalloc. First approximation can look like:

static (int Found, int Processed) IndexOf(ReadOnlySpan<byte> utf8Bytes, char @char)
{
    Span<char> chars = stackalloc char[utf8Bytes.Length]; // "worst" case every byte is a separate char
    var proc = Encoding.UTF8.GetChars(utf8Bytes, chars);
    var indexOf = chars.IndexOf(@char);
    if (indexOf > 0)
    {
        Span<byte> bytes = stackalloc byte[indexOf * 4];
        var result = Encoding.UTF8.GetBytes(chars.Slice(0, indexOf), bytes);
        return (result, proc);
    }
    return (indexOf, proc);
}

There are few notes here:

  1. Big incoming spans can result in SO
  2. Decoding the whole array is not optimal
  3. Span can contain "partial" codepoints at start and end so Processed should be processed accordingly
  4. First two points can be mitigated by processing the incoming span in slices of smaller size (for example reading 4 bytes into 4 chars spand).

Actually I believe that System.IO.Pipelines handles the same issues (via System.Buffers I believe) though it 1) it can be not completely allocation free I believe 2) I still have not investigated it that much so would not be able to provide a completely working example.

CodePudding user response:

From .NET 5 onwards, there's a library method EncodingExtensions.GetChars to help you.

Specifically, you want the overload that gets the byte data from a ReadOnlySpan and writes to an IBufferWriter<char>, which you can then implement to receive your characters one by one and run whatever on them (your matching algorithm, for example). This solution is allocation-free of course, as long as you put your custom buffer writer in a static field and allocate it only once.

CodePudding user response:

Rather than trying to iterate through the utf8Bytes character by character, it may be easier to convert the character to a short stackalloc'ed utf8 byte sequence, and search for that:

public static class StringExtensions
{
    const int MaxBytes = 4;

    public static int IndexOf(this ReadOnlySpan<byte> utf8Bytes, char @char)
    {
        Rune rune;
        try
        {
            rune = new Rune(@char);
        }
        catch (ArgumentOutOfRangeException)
        {
            // Malformed unicode character, return -1 or throw?
            return -1;
        }
        return utf8Bytes.IndexOf(rune);
    }       
    
    public static int IndexOf(this ReadOnlySpan<byte> utf8Bytes, Rune @char)
    {
        Span<byte> charBytes = stackalloc byte[MaxBytes];
        var n = @char.EncodeToUtf8(charBytes);
        charBytes = charBytes.Slice(0, n);
        
        for (int i = 0, thisLength = 1; i <= utf8Bytes.Length - charBytes.Length; i  = thisLength)
        {
            thisLength = Utf8ByteSequenceLength(utf8Bytes[i]);
            if (thisLength == charBytes.Length && charBytes.CommonPrefixLength(utf8Bytes.Slice(i)) == charBytes.Length)
                return i;
        }
        return -1;
    }       
    
    static int Utf8ByteSequenceLength(byte firstByte)
    {
        //https://en.wikipedia.org/wiki/UTF-8#Encoding
        if (     (firstByte & 0b11111000) == 0b11110000) // 11110xxx
            return 4;
        else if ((firstByte & 0b11110000) == 0b11100000) // 1110xxxx
            return 3;
        else if ((firstByte & 0b11100000) == 0b11000000) // 110xxxxx
            return 2;
        return 1; // Either a 1-byte sequence (matching 0xxxxxxx) or an invalid start byte.
    }
}

Notes:

  • Rune is a struct introduced in .NET Core 3.x that represents a Unicode scalar value. If you need to search your utf8Bytes for a Unicode codepoint that is not in the basic multilingual plane, you will need to use Rune.

    Rune has the added advantage that its method Rune.TryEncodeToUtf8() is lightweight and allocation-free.

  • If char @char is an invalid Unicode character, the .NET encoding algorithms will throw an exception if you attempt to construct a Rune from it. The above code catches the exception and returns -1. You may wish to rethrow the exception.

  • As an alternative, Rune.DecodeFromUtf8(ReadOnlySpan<Byte>, Rune, Int32) can be used to iterate through a utf8 byte span Rune by Rune. You could use that to locate an incoming Rune by index. However, I suspect doing so would be less efficient than the method above.

Demo fiddle here.

  •  Tags:  
  • c#
  • Related