I have several strings that I want to loop through and tokenize
ImagesCarrier FreeCatalog #AvailabilitySize / PriceQty240-B-001MG/CF240-B-002/CF240-B-010/CF240-B-500/CFWith CarrierCatalog #AvailabilitySize / PriceQty240-B-002240-B-010Request a Quote
All of the string splitting methods I have used so far give me an output that isnt quite right. For example:
import re
s = "ImagesCarrier FreeCatalog #AvailabilitySize / PriceQty240-B-001MG/CF240-B-002/CF240-B-010/CF240-B-500/CFWith CarrierCatalog #AvailabilitySize / PriceQty240-B-002240-B-010Request a Quote"
# printing original String
print("The original string is : " str(s))
# using sub() to solve the problem, lambda used tp add spaces
res = re.sub("[A-Za-z] ", lambda ele: " " ele[0] " ", s)
# printing result
print("The space added string : " str(res))
The original string is : ImagesCarrier FreeCatalog #AvailabilitySize / PriceQty240-B-001MG/CF240-B-002/CF240-B-010/CF240-B-500/CFWith CarrierCatalog #AvailabilitySize / PriceQty240-B-002240-B-010Request a Quote
The space added string : ImagesCarrier FreeCatalog # AvailabilitySize / PriceQty 240- B -001 MG / CF 240- B -002/ CF 240- B -010/ CF 240- B -500/ CFWith CarrierCatalog # AvailabilitySize / PriceQty 240- B -002240- B -010 Request a Quote
But as you can see some words are still missed like PriceQty, AvailabilitySize, ImagesCarrier FreeCatalog, etc. Is there a better way to do this or a way that I can specify a keyword list that will iterate through all characters and split when matched? Ideally I would like to end up with something like this:
Images Carrier Free Catalog # Availability Size / Price Qty 240-B-001MG/CF 240-B-002/CF 240-B-010/CF 240-B-500/CF With Carrier Catalog # Availability Size / Price Qty 240-B-002 240-B-010 Request a Quote
CodePudding user response:
You can use lookarounds:
re.sub(r'(?<=\S)(?=[A-Z][a-z])|(?<=[A-Za-z])(?=\d)', ' ', s)