I want to count and remove consecutive words in string or df rows. I
***input :***
str = "ettim ettim deneme karar verdim verdim buna buna buna"
***output :***
output = "ettim deneme karar verdim buna"
output2 = { "ettim" : 2, "verdim" :2 , "buna" : 3"}
How do I do fastly method with regex or something else
Thanks
CodePudding user response:
Try:
import regex as re
s = 'ettim ettim deneme karar verdim verdim buna buna buna'
rgx = re.compile(r'(?<!\S)(\S )(?:\s \1)*(?!\S)', re.I)
output1 = re.sub(rgx, r'\1', s)
output2 = {}
for i in re.finditer(rgx, s):
if i.group(1) != i.group(0):
output2[i.group(1)] = len(re.split(r'\s ', i.group(0)))
print(output1)
print(output2)
Prints:
ettim deneme karar verdim buna
{'ettim': 2, 'verdim': 2, 'buna': 3}
Core of the idea above is to use re.compile(r'(?<!\S)(\S )(?:\s \1)*(?!\S)', re.I)
to match case-insensitive consecutive words. See an online demo.
(?<!\S)
- Negative lookbehind to assert position is not preceded by a non-whitespace character;(\S )
- 1st Capture group to match 1 non-whitespace characters;(?:\s \1)*
- Match 0 times a non-capture group holding 1 whitespace characters and a backreference to what is matched previously in 1st group;(?!\S)
- Negative lookahead to assert position is not followed by a non-whitespace character.
EDIT: I did notice that if the same consecutive words occur multiple times in the same text you may end up overwriting your dictionary's values. To stop that I edited the keys a bit:
import regex as re
s = 'ettim ettim buna buna deneme karar verdim verdim buna buna buna'
rgx = re.compile(r'\b(\S )(?:\s \1)*\b', re.I)
output1 = re.sub(rgx, r'\1', s)
output2 = {}
c = 0
for i in re.finditer(rgx, s):
if i.group(1) != i.group(0):
c = c 1
output2[str(c) "-" i.group(1)] = len(re.split(r'\s ', i.group(0)))
print(output1)
print(output2)
Prints:
ettim buna deneme karar verdim buna
{'1-ettim': 2, '2-buna': 2, '3-verdim': 2, '4-buna': 3}
CodePudding user response:
You can put the string in a list (let's name it "list_of_words = []") and then use
result = [i[0] for i in groupby(list_of_words)]
print("result is:" str(result))
CodePudding user response:
You can split string by space, then add them to OrderedDict().
from collections import OrderedDict
str = "ettim ettim deneme karar verdim verdim buna buna buna"
separated = str.split(" ")
od = OrderedDict()
for sep in separated:
if sep in od:
od[sep] = 1
else:
od[sep] = 1
print(od)
distinctStr = ""
for word in od:
distinctStr = word " "
# remove last space character
distinctStr = distinctStr.rstrip(distinctStr[-1])
print(distinctStr)
Output:
OrderedDict([('ettim', 2), ('deneme', 1), ('karar', 1), ('verdim', 2), ('buna', 3)])
ettim deneme karar verdim buna
CodePudding user response:
try
Importing
from collections import *
Strings
str = "ettim ettim deneme karar verdim verdim buna buna buna"
strings = str.split()
Counter
c = dict(Counter(strings))
print(c)
Output
{'ettim': 2, 'deneme': 1, 'karar': 1, 'verdim': 2, 'buna': 3}