I tried to make a frequency histogram of sentences which end with exclamation marks, question marks, as well as sentences ending with a dot in the text (I just counted the number of these characters in the text). The text is read from the file. The code I've done looks like this:
import matplotlib.pyplot as plt
text_file = 'text.txt'
marks = '?!.'
lcount = dict([(l, 0) for l in marks])
for l in open(text_file, encoding='utf8').read():
try:
lcount[l.upper()] = 1
except KeyError:
pass
norm = sum(lcount.values())
fig = plt.figure()
ax = fig.add_subplot(111)
x = range(3)
ax.bar(x, [lcount[l]/norm * 100 for l in marks], width=0.8,
color='g', alpha=0.5, align='center')
ax.set_xticks(x)
ax.set_xticklabels(marks)
ax.tick_params(axis='x', direction='out')
ax.set_xlim(-0.5, 2.5)
ax.yaxis.grid(True)
ax.set_ylabel('Sentences with different ending frequency, %')
plt.show()
But I can’t count others sentence that end with an ellipsis(it means ...), my code counts as three characters, so three sentences. Moreover, this is count of symbols, not actually sentences. How can I improve with counting of sentences, not marks and counting of sentences ending with ellipsis? Example of file: Wanna play? Let's go! It will be definitely good! My friend also think so. However, I don't know... I don't like this.
CodePudding user response:
You could try splitting the sentences using a regex
. The re.split()
function works fine here:
Sample code:
import re
string = "Wanna play? Let's go! It will be definitely good! My friend also think so. However, I don't know... I don't like this."
print(re.split('\. \s*|! \s*|\? \s*', string))
Output:
['Wanna play', "Let's go", 'It will be definitely good', 'My friend also think so', "However, I don't know", "I don't like this", '']
CodePudding user response:
By the look of you example, and a common way to handle words in NLP (with regards to lemmatization etc.) you can first split the sentences on spaces:
marks = ['?', '!', '...', '.'] # Ordering of ellipsis first gives prio of checking that condition
mark_count = {}
sentences = 0
for example in file:
words = example.split(" ") # This gives an array of all words togheter with your symbols
# Finds markings in word
for word in words:
for mark in marks:
if mark in word:
# Here you will find and count end of sentence and break if found, as a
# word can be "know..." the first dot will be found and then we break as we know
# thats the end of the sentence.
sentences = 1
if mark_count.get(mark, False):
mark_count[mark] = 1
else:
mark_count[mark] = 1
break
Edit:
To guarantee correct pairing
columns = []
values = []
for key in mark_count.keys():
columns.append(key)
values.append(mark_count[key])
plt.bar(columns, values, color='green')