I want to count and remove consecutive words in string or df rows. I

***input :***    
str = "ettim ettim deneme karar verdim verdim buna buna buna"

***output :***

output = "ettim deneme karar verdim buna"

output2 = { "ettim" : 2, "verdim" :2 , "buna" : 3"}

How do I do fastly method with regex or something else

Thanks

CodePudding user response：

Try:

import regex as re

s = 'ettim ettim deneme karar verdim verdim buna buna buna'
rgx = re.compile(r'(?<!\S)(\S )(?:\s \1)*(?!\S)', re.I)

output1 = re.sub(rgx, r'\1', s)
output2 = {}

for i in re.finditer(rgx, s):
    if i.group(1) != i.group(0):
        output2[i.group(1)] = len(re.split(r'\s ', i.group(0)))  

print(output1)
print(output2)

Prints:

ettim deneme karar verdim buna
{'ettim': 2, 'verdim': 2, 'buna': 3}

Core of the idea above is to use re.compile(r'(?<!\S)(\S )(?:\s \1)*(?!\S)', re.I) to match case-insensitive consecutive words. See an online demo.

(?<!\S) - Negative lookbehind to assert position is not preceded by a non-whitespace character;
(\S ) - 1st Capture group to match 1 non-whitespace characters;
(?:\s \1)* - Match 0 times a non-capture group holding 1 whitespace characters and a backreference to what is matched previously in 1st group;
(?!\S) - Negative lookahead to assert position is not followed by a non-whitespace character.

EDIT: I did notice that if the same consecutive words occur multiple times in the same text you may end up overwriting your dictionary's values. To stop that I edited the keys a bit:

import regex as re

s = 'ettim ettim buna buna deneme karar verdim verdim buna buna buna'
rgx = re.compile(r'\b(\S )(?:\s \1)*\b', re.I)

output1 = re.sub(rgx, r'\1', s)
output2 = {}
c = 0

for i in re.finditer(rgx, s):
    if i.group(1) != i.group(0):
        c = c   1
        output2[str(c)   "-"   i.group(1)] = len(re.split(r'\s ', i.group(0)))


print(output1)
print(output2)

Prints:

ettim buna deneme karar verdim buna
{'1-ettim': 2, '2-buna': 2, '3-verdim': 2, '4-buna': 3}

CodePudding user response：

You can put the string in a list (let's name it "list_of_words = []") and then use

result = [i[0] for i in groupby(list_of_words)]

print("result is:"   str(result))

CodePudding user response：

You can split string by space, then add them to OrderedDict().

from collections import OrderedDict

str = "ettim ettim deneme karar verdim verdim buna buna buna"

separated = str.split(" ")

od = OrderedDict()

for sep in separated:
    if sep in od:
        od[sep]  = 1
    else:
        od[sep] = 1

print(od)        

distinctStr = ""
for word in od:
    distinctStr  = word   " "

# remove last space character
distinctStr = distinctStr.rstrip(distinctStr[-1])
    
print(distinctStr)

Output:

OrderedDict([('ettim', 2), ('deneme', 1), ('karar', 1), ('verdim', 2), ('buna', 3)])
ettim deneme karar verdim buna

CodePudding user response：

try

Importing

from collections import *

Strings

str = "ettim ettim deneme karar verdim verdim buna buna buna"
strings = str.split()

Counter

c = dict(Counter(strings))
print(c)

Output

{'ettim': 2, 'deneme': 1, 'karar': 1, 'verdim': 2, 'buna': 3}