How can I retain spaces between words that don't contain numbers when splitting a multi-word st-CodePudding

Is it possible to split a string by space but leave names together?

Example:

"1 23565 john smith 01/01/2021 another"

Expected:

string[] {"1", "23565", "john smith", "01/01/2021", "another"}

Names in this case are any word in the string that don't contain numbers. "Name" words are always preceded and succeeded by "number" words.

CodePudding user response：

You can try regular expressions, e.g.

 using System.Text.RegularExpressions;

 ...

 string source = "1 23565 john smith 01/01/2021 another";

 string[] result = Regex.Split(source, @"(?<=\P{L})\s |\s (?=\P{L})");

 // Let's have a look:
 Console.WriteLine(string.Join(", ", result));

Outcome:

 1, 23565, john smith, 01/01/2021, another

Here I've put (?<=\P{L})\s |\s (?=\P{L}) pattern:

(?<=\P{L})\s  - look behind (not a letter) 
                then one or more whitespaces
|             - or
\s (?=\P{L})  - one or more whitespaces and 
                then (look ahead) not a letter

CodePudding user response：

This seems like an XY problem, so I'm going to focus on solving X instead of Y.

No and yes.

There may be some Regex to do this for you, but that includes extra complexity and overhead. And I don't know regex well enough to offer an example.

If you try just a regular string.split, then you can't avoid splitting the name. However, if you know that the string will be formatted exactly the same way 100% of the time, you can concatenate the 2nd & 3rd instance of the split string back together, but it would be a manual process and it would break if you ever changed the string format.

string value = "1 23565 john smith 01/01/2021 another";
List<string> values = value.Split(" ");
values[2]  = " "   values[3];
values.RemoveAt(3);

A better option might be to avoid this altogether by using JSON or XML.

These are both data transfer specifications which allow you to keep different pieces of data separated while still transferring them together.

Examples:

// JSON
{
  id: 1,
  randomNumber: 23565,
  name: "john smith",
  hireDate: "01/01/2021",
  description: "another"
}

// XML
<?xml version="1.0" encoding="UTF-8"?>
<id>1</id>
<randomNumber>23565</randomNumber>
<name>john smith</name>
<hireDate>01/01/2021</hireDate>
<description>another</description>

As you might notice, XML has a bit more complexity and character count to it. It's still a very popular method of data transport, but the smaller size and easier to understand format of JSON are causing people to convert more projects over to JSON.

There are plenty of existing libraries to convert data into and out of both of these formats, so you don't have to do any of that. The C# language itself has some conversion for these formats built-in, but that has limits and caveats that tend to make people use the external libraries (commonly found on NuGet).

And with the generally fast internet speeds people have for even their home and cell use, the small amount of extra overhead needed for these formats generally doesn't outweigh the ease of using the data formats.

CodePudding user response：

We are asked to split the string on spaces other than spaces between names, where a name is any word in the string that does not contain a number.

One can split on matches of the following:

(?<=(?:^| )\S*\d\S*) | (?=\S*\d\S*(?: |\z))

Demo

The matches are shown below by the party hats.

1 23565 john smith 01/01/2021 another
 ^     ^          ^          ^

1 23565 john smith1 01/01/2021 another
 ^     ^    ^      ^          ^

1 23565 1john smith 01/01/2021 another
 ^     ^     ^     ^          ^

1 23565 jo1hn sm1th 01/01/2021 another
 ^     ^     ^     ^          ^

The regular expression has the following elements:

(?<=        # begin positive lookbehind
  (?:^| )   # match beginning of string or space in a non-capture group
  \S*\d\S*  # match 1  chars other than whitespace followed by
            # a digits followed by 1  chars other than whitespace
)           # end negative lookbehind
[ ]         # match a space
|           # or
[ ]         # match a space
(?=         # begin positive lookahead
  \S*\d\S*  # match 1  chars other than whitespace followed by
            # a digits followed by 1  chars other than whitespace
  (?: |\z)  # match a space or end of string in a non-capture group
)           # end positive lookahead

I've put each of the spaces in a character class ([ ]) to make them visible.

This regular expression makes use of the fact that C#'s regex engine supports variable-length lookbacks, a feature most regex engines lack.

Depending on requirements, one may want to replace \S*\d\S* with [a-z\d]*\d[a-z\d\*, possibly with the case-indifferent flat (i) set.