I'm cleaning some data and need to remove specific occurrences of a comma from strings. Any commas that appear in between a surname and initials needs removing, i.e.
Summers, B., Rosenberg, W., Giles, R., Harris, A., Modern Advances in Patrolling: The Evolution of Mr Pointy, (1997), The Watchers Council Journal 5(5), pp. 5-55
would need to appear as:
Summers B., Rosenberg W., Giles R., Harris A., Modern Advances in Patrolling: The Evolution of Mr Pointy, (1997), The Watchers Council Journal 5(5), pp. 5-55
The regex for finding the offending pattern I have as:
pattern = r',\s([A-Z]\.) '
Is there a way to get the indexes of the beginning of a match within the string, in other words the location of the problematic commas?
If it's easier I have the regex for the match we want also:
initials = '([A-Z](\.)? '
alphaWord = '[a-zA-Z][a-z] '
name = f'({alphaWord})(\b({alphaWord}))*'
citeName = f'({name})\s({initials})\.'
CodePudding user response:
Thats going to difficult to do since you have other comma separated info that does not need the fix. Just from your example, can you just try to replace based on the capital and '.' instead?
Search for <comma space capital_letter period comma>
Replace with <space capital_letter period comma>
CodePudding user response:
st = "Summers, B., Rosenberg, W., Giles, R., Harris, A., Modern Advances in Patrolling: The Evolution of Mr Pointy, (1997), The Watchers Council Journal 5(5), pp. 5-55"
p = re.compile("([A-Z][a-z] ),")
print(re.sub(p,r'\1',st))
Summers B., Rosenberg W., Giles R., Harris A., Modern Advances in Patrolling: The Evolution of Mr Pointy (1997), The Watchers Council Journal 5(5), pp. 5-55
for m in p.finditer(st):
print(m.start(), m.end(), m.group())
0 8 Summers,
13 23 Rosenberg,
28 34 Giles,
39 46 Harris,
102 109 Pointy,