Home > Back-end >  Split text using strings in a list
Split text using strings in a list

Time:10-07

I'm trying to split this text by 'Three', 'Seven', 'Nine' and 'One'

Three Rings for the Elven-kings under the sky
Seven for the Dwarf-lords in their halls of stone
Nine for Mortal Men doomed to die
One for the Dark Lord on his dark throne
In the Land of Mordor where the Shadows lie
One Ring to rule them all, One Ring to find them
One Ring to bring them all and in the darkness bind them
In the Land of Mordor where the Shadows lie

My intention is to previously search for these words in the text, save them in a list and then use the elements of the list to split the text. I know it's weird but I would like to know if it is possible to split it like this and not do it directly with some regex.

I have managed to do the first part, so I have a list like this:

['Three', 'Seven', 'Nine', 'One]

But if I want to split the text considering the list, this is the output:

['', 'Three', ' Rings for the Elven-kings under the sky\n', 'Seven', ' for the Dwarf-lords in their halls of stone\n', 'Nine', ' for Mortal Men doomed to die\n', 'One', ' for the Dark Lord on his dark throne\nIn the Land of Mordor where the Shadows lie\n', 'One', ' Ring to rule them all, ', 'One', ' Ring to find them\n', 'One', ' Ring to bring them all and in the darkness bind them\nIn the Land of Mordor where the Shadows lie']

Is there a way to do the split using the words in the list and keep the separator? My desired output would be this:

['Three Rings for the Elven-kings under the sky',
'Seven for the Dwarf-lords in their halls of stone',
'Nine for Mortal Men doomed to die',
'One for the Dark Lord on his dark throne
In the Land of Mordor where the Shadows lie',
'One Ring to rule them all, One Ring to find them',
'One Ring to bring them all and in the darkness bind them
In the Land of Mordor where the Shadows lie']

Thank you!

CodePudding user response:

You can use an empty pattern with a lookahead.

re.split(r'(?=\b(?:Three|Seven|Nine|One))\b', text)

result:

['', 
 'Three Rings for the Elven-kings under the sky\n', 
 'Seven for the Dwarf-lords in their halls of stone\n', 
 'Nine for Mortal Men doomed to die\n', 
 'One for the Dark Lord on his dark throne\nIn the Land of Mordor where the Shadows lie\n', 
 'One Ring to rule them all, ', 
 'One Ring to find them\n', 
 'One Ring to bring them all and in the darkness bind them\nIn the Land of Mordor where the Shadows lie'
]

Or you can use the solution that returned your result, and concatenate each pair of strings.

CodePudding user response:

You may use this regex in re.findall:

. ?(?:\n(?=(?:Three|Seven|Nine|One) )|\Z)

RegEx Demo

RegEx Breakup:

  • . ?: Match 1 of any character (lazy)
  • (?:: Start non-capture group
    • \n(?=(?:Three|Seven|Nine|One) ): Match line break followed by one of the given words and then a space
    • |: OR
    • \Z: Match end of input
  • ): End non-capture group
  • Related