Home > database >  How can I split text with these custom rules?
How can I split text with these custom rules?

Time:02-14

I want to split the text:

import re

text = " climb   -  95/ 85     0.18   low     -  4680"

split_text = re.split("  ", text)

print(split_text)

['', 'climb', '-', '95/', '85', '0.18', 'low', '-', '4680']

My problem is that " 95/ 85" should be not be splitted.

How can I get as result:

# scanned_text = ['', 'climb', '-', ' 95/ 85', '0.18', 'low', '-', '4680']

CodePudding user response:

Simply add a second space before the . This will stop the 95/ 85 from being split. If you want \n at the end of the last item, add text = "\n".

import re

text = " climb   -  95/ 85     0.18   low     -  4680"

text = "a "   text

text  = "\n"

split_text = re.split("   ", text)

if split_text[0] == "a":
  split_text[0] = ""
else:
  split_text[0] = split_text[0][2:]

print(split_text)

CodePudding user response:

There could be many rules that would apply to your single example but still be wrong for the pattern of data you have to process. So you're forcing us to guess what the rule for that 95/ 85 exception is.

Here's a wild guess: spaces following a forward slash are not to be treated as separators

In which case, you could handle it using a look behind:

import re

text = " climb   -  95/ 85     0.18   low     -  4680"

split_text = re.split(r"(?<!\/)  ", text)

print(split_text)

['', 'climb', '-', '95/ 85', '0.18', 'low', '-', '4680']

The exception rule could also be : The 4th and 5th values need to be combined

In which case you could do this:

split_text = re.split("  ", text)

split_text[3:5] = [" ".join(split_text[3:5])]

print(split_text)

['', 'climb', '-', '95/ 85', '0.18', 'low', '-', '4680']

Obviously different rules that give the right output for this example will produce different results for other strings. That's why you need to be specific.

CodePudding user response:

You can ask to split with at least 2 spaces

import re

text = " climb   -  95/ 85     0.18   low     -  4680"

split_text = re.split("\s{2,}", text)

print(split_text)
# [' climb', '-', '95/ 85', '0.18', 'low', '-', '4680']

Works too without regex

text = " climb   -  95/ 85     0.18   low     -  4680"

split_text = text.split('  ')

print(split_text)
# [' climb', ' -', '95/ 85', '', ' 0.18', ' low', '', ' -', '4680']

With some more manipulation, you can also remove extra spaces

text = " climb   -  95/ 85     0.18   low     -  4680"

split_text = list(map(lambda x: x.strip(), text.split('  ')))

print(split_text)
# ['climb', '-', '95/ 85', '', '0.18', 'low', '', '-', '4680']
  • Related