Home > Enterprise >  Split text into list based on specific pattern python
Split text into list based on specific pattern python

Time:03-25

I want to split my text into list based on certain pattern. For example my text is:

134. Lorem Ipsum is simply dummy text of the printing and typesetting industry 135. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book 136. It has survived not only five centuries

I want to convert it into a list based on the unique number as below:

    [134. Lorem Ipsum is simply dummy text of the printing and typesetting industry, 
     135. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
     136. It has survived not only five centuries]

I already tried using:

import re
xx = re.split(pattern="d{1,3}. ", string=file_read)
list = []

for xy in xx:
    xy = re.sub(pattern="\s ", repl=" ", string=xy)
    list.append(xy)

But the output is:

[134. Lorem Ipsum is simply dummy text of the printing and typesetting industry 135. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s 136. It has survived not only five centuries]

CodePudding user response:

You can write:

str = "134. Lorem Ipsum is simply dummy text of the printing and typesetting industry 135. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book 136. It has survived not only five centuries"
rgx = r'  (?=\d \.  [A-Z])'
re.split(rgx, str)
  #=> ['134. Lorem Ipsum is simply dummy text of the printing and typesetting industry',
  #    "135. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book",
  #    '136. It has survived not only five centuries']

Python demo<-\(ツ)/->Regex demo

As seen, the string is split on matches of one or more spaces. The regular expression reads, "match one or more spaces immediately followed by one or more digits followed by a period, followed by one or more spaces, followed by a capital letter".

CodePudding user response:

The other way around could be matching what you want using for example re.findall

Note that to match a digit, you have to escape the d like \d{1,3}

\b\d{1,3}\. .*?(?=\b\d{1,3}\. |$)

The pattern matches:

  • \b\d{1,3}\. A word boundary, match 1-3 digits, a dot and space
  • .*? Match as least as possible characters
  • (?= Positive lookahead to assert to the right
    • \b\d{1,3}\. |$ Match the number pattern to the right or the end of string
  • ) Close lookahead

See a regex demo and a Python demo.

Example

import re

pattern = r"\b\d{1,3}\. .*?(?=\b\d{1,3}\. |$)"
s = "134. Lorem Ipsum is simply dummy text of the printing and typesetting industry 135. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book 136. It has survived not only five centuries"

print(re.findall(pattern, s))

Output

[
'134. Lorem Ipsum is simply dummy text of the printing and typesetting industry ',
"135. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book ",
'136. It has survived not only five centuries'
]
  • Related