Python regex.split text export each splitting as .txt with each of word splitting as filename to specified path folder

Hello community! I learn Python and I try to make different actions with a text :

Split text with NLTK regex.split

Regex.split without empty results as '' and solo '-' except 'word-hyphen'
Export each split as .txt with each split words as filename to a specified folder --> Regex.split without empty results as '' and solo '-' except 'word-hyphen' to not create empty files

Step 1 is done :

    # coding: utf-8
    import nltk
    s = "This sentence is in first place. This second sentence isn't in first place."
    import regex
    regex.split("[\s\.\,]", s)
    ['This', 'sentence', 'is', 'in', 'first', 'place', '', 'This', 'second', 'sentence', "isn't", 'in', 'first', 'place', '']

Step 2 and 3 is what I try to do :

2. Do not count empty results as '' and solo '-' except 'word-hyphen'

What is done for step 2 :

    # coding: utf-8
    import nltk
    s = "This sentence is in first place and contain a word-hyphen — Hello I am the second sentence and I'm in second place."
    import regex
    regex.split("[\s\.;!?…»,«\,]", s)
    ['This', 'sentence', 'is', 'in', 'first', 'place', 'and', 'contain', 'a', 'word-hyphen', '-', 'Hello', 'I', 'am', 'the', 'second', 'sentence', 'and', "I'm", 'in', 'second', 'place', '']

3. Export each split as .txt with each split words as filename to a specified folder

Someone knows how we can make something like that ?

Thanks for your help

CodePudding user response：

You're not using nltk's regular expression engine. Maybe you want RegexpTokenizer?

Because you're not using variables and have this "automatic printing", I'm guessing you're using command-line or IDLE. You'll have to use variables for the step 3, and at some point you'll have to use .py files too. Let's begin now; and if I'm wrong, sorry.

As you're asked to not have empty results in step 2, it indicates that you have a problem in step 1. Let's try with RegexpTokenizer then:

from nltk.tokenize import RegexpTokenizer
s = "This sentence is in first place. This second sentence isn't in first place."
tokenizer = RegexpTokenizer("[\s\.\,]", gaps=True)
split=tokenizer.tokenize(s)
print(split)

Output:

['This', 'sentence', 'is', 'in', 'first', 'place', 'This', 'second', 'sentence', "isn't", 'in', 'first', 'place']

No empty results here, we're good.

For step2, I don't understand your regex: just take the one from step1 "[\s\.\,]" and add the dash "[\s\.\,—]":

from nltk.tokenize import RegexpTokenizer
s = "This sentence is in first place and contain a word-hyphen — Hello I am the second sentence and I'm in second place."
tokenizer = RegexpTokenizer("[\s\.\,—]", gaps=True)
split=tokenizer.tokenize(s)
print(split)

Output:

['This', 'sentence', 'is', 'in', 'first', 'place', 'and', 'contain', 'a', 'word-hyphen', 'Hello', 'I', 'am', 'the', 'second', 'sentence', 'and', "I'm", 'in', 'second', 'place']

For the step 3, the simplest way should be this:

import os.path
path_to_files = 'C:\\Users\\username\\Desktop\\Split txt export'

for word in split:
    filename=word '.txt'
    fullpath=os.path.join(path_to_files, filename)
    with open(fullpath, 'w') as f:
        f.write(word)