Home > OS >  Most efficient way to split substrings into individual words
Most efficient way to split substrings into individual words

Time:11-16

I have several strings that I want to loop through and tokenize

ImagesCarrier FreeCatalog #AvailabilitySize / PriceQty240-B-001MG/CF240-B-002/CF240-B-010/CF240-B-500/CFWith CarrierCatalog #AvailabilitySize / PriceQty240-B-002240-B-010Request a Quote

All of the string splitting methods I have used so far give me an output that isnt quite right. For example:

import re
s = "ImagesCarrier FreeCatalog #AvailabilitySize / PriceQty240-B-001MG/CF240-B-002/CF240-B-010/CF240-B-500/CFWith CarrierCatalog #AvailabilitySize / PriceQty240-B-002240-B-010Request a Quote"
 
# printing original String
print("The original string is : "   str(s))
 
# using sub() to solve the problem, lambda used tp add spaces
res = re.sub("[A-Za-z] ", lambda ele: " "   ele[0]   " ", s)
 
# printing result
print("The space added string : "   str(res))


The original string is : ImagesCarrier FreeCatalog #AvailabilitySize / PriceQty240-B-001MG/CF240-B-002/CF240-B-010/CF240-B-500/CFWith CarrierCatalog #AvailabilitySize / PriceQty240-B-002240-B-010Request a Quote

The space added string :  ImagesCarrier   FreeCatalog  # AvailabilitySize  /  PriceQty 240- B -001 MG / CF 240- B -002/ CF 240- B -010/ CF 240- B -500/ CFWith   CarrierCatalog  # AvailabilitySize  /  PriceQty 240- B -002240- B -010 Request   a   Quote 

But as you can see some words are still missed like PriceQty, AvailabilitySize, ImagesCarrier FreeCatalog, etc. Is there a better way to do this or a way that I can specify a keyword list that will iterate through all characters and split when matched? Ideally I would like to end up with something like this:

Images Carrier Free Catalog # Availability Size / Price Qty 240-B-001MG/CF 240-B-002/CF 240-B-010/CF 240-B-500/CF With Carrier Catalog # Availability Size / Price Qty 240-B-002 240-B-010 Request a Quote

CodePudding user response:

You can use lookarounds:

re.sub(r'(?<=\S)(?=[A-Z][a-z])|(?<=[A-Za-z])(?=\d)', ' ', s)

Demo: https://replit.com/@blhsing/DownrightTwinTelephones

  • Related