Home > database >  Splitting sentences from a .txt file to .csv using NLTK
Splitting sentences from a .txt file to .csv using NLTK

Time:12-16

I have a corpus of newspaper articles in a .txt file, and I'm trying to split the sentences from it to a .csv in order to annotate each sentence.

I was told to use NLTK for this purpose, and I found the following code for sentence splitting:

import nltk

from nltk.tokenize import sent_tokenize

sent_tokenize("Here is my first sentence. And that's a second one.")

However, I'm wondering:

  1. How does one use a .txt file as an input for the tokenizer (so that I don't have to just copy and paste everything), and
  2. How does one output a .csv file instead of just printing the sentences in my terminal.

CodePudding user response:

Reading a .txt file & tokenizing its sentences

Assuming the .txt file is located in the same folder as your Python script, you can read a .txt file and tokenize the sentences using NLTK as shown below:

from nltk import sent_tokenize

with open("myfile.txt") as file:
    textFile = file.read()

tokenTextList = sent_tokenize(textFile)
print(tokenTextList)
# Output: ['Here is my first sentence.', "And that's a second one."]

Writing a list of sentence tokens to .csv file

There are a number of options for writing a .csv file. Pick whichever is more convenient (e.g. if you already have pandas loaded, use the pandas option).

To write a .csv file using the pandas module:

import pandas as pd

df = pd.DataFrame(tokenTextList)
df.to_csv("myCSVfile.csv", index=False, header=False)

To write a .csv file using the numpy module:

import numpy as np

np.savetxt("myCSVfile.csv", tokenTextList, delimiter=",", fmt="%s")

To write a .csv file using the csv module:

import csv

with open('myCSVfile.csv', 'w', newline='') as file:
    write = csv.writer(file, lineterminator='\n')
    # write.writerows([tokenTextList])
    write.writerows([[token] for token in tokenTextList]) # For pandas style output
  • Related