Home > Net >  Search of sentences ending with specific marks and frequency histogram
Search of sentences ending with specific marks and frequency histogram

Time:04-02

I tried to make a frequency histogram of sentences which end with exclamation marks, question marks, as well as sentences ending with a dot in the text (I just counted the number of these characters in the text). The text is read from the file. The code I've done looks like this:

import matplotlib.pyplot as plt
 
text_file = 'text.txt'
 
marks = '?!.'
lcount = dict([(l, 0) for l in marks])
 
 
for l in open(text_file, encoding='utf8').read():
    try:
        lcount[l.upper()]  = 1
    except KeyError:
        pass
norm = sum(lcount.values())
 
fig = plt.figure()
ax = fig.add_subplot(111)
x = range(3)
ax.bar(x, [lcount[l]/norm * 100 for l in marks], width=0.8,
       color='g', alpha=0.5, align='center')
ax.set_xticks(x)
ax.set_xticklabels(marks)
ax.tick_params(axis='x', direction='out')
ax.set_xlim(-0.5, 2.5)
ax.yaxis.grid(True)
ax.set_ylabel('Sentences with different ending frequency, %')
plt.show()

But I can’t count others sentence that end with an ellipsis(it means ...), my code counts as three characters, so three sentences. Moreover, this is count of symbols, not actually sentences. How can I improve with counting of sentences, not marks and counting of sentences ending with ellipsis? Example of file: Wanna play? Let's go! It will be definitely good! My friend also think so. However, I don't know... I don't like this.

CodePudding user response:

You could try splitting the sentences using a regex. The re.split() function works fine here: Sample code:

import re
string = "Wanna play? Let's go! It will be definitely good! My friend also think so. However, I don't know... I don't like this."
print(re.split('\. \s*|! \s*|\? \s*', string))

Output:

['Wanna play', "Let's go", 'It will be definitely good', 'My friend also think so', "However, I don't know", "I don't like this", '']

CodePudding user response:

By the look of you example, and a common way to handle words in NLP (with regards to lemmatization etc.) you can first split the sentences on spaces:

marks = ['?', '!', '...', '.'] # Ordering of ellipsis first gives prio of checking that condition 
mark_count = {}
sentences = 0

for example in file:
    words = example.split(" ") # This gives an array of all words togheter with your symbols
    
    # Finds markings in word
    for word in words:
        for mark in marks:
            if mark in word:
                # Here you will find and count end of sentence and break if found, as a 
                # word can be "know..." the first dot will be found and then we break as we know 
                # thats the end of the sentence.
                sentences  = 1
                
                if mark_count.get(mark, False):
                    mark_count[mark]  = 1
                else:
                    mark_count[mark] = 1

                break
                

Edit:

To guarantee correct pairing

columns = []
values = []
for key in mark_count.keys():
    columns.append(key)
    values.append(mark_count[key])

plt.bar(columns, values, color='green')
  • Related