Home > other >  Python Regex.split text export each split as .txt with each split words as filename to a specified
Python Regex.split text export each split as .txt with each split words as filename to a specified

Time:10-04

Python regex.split text export each splitting as .txt with each of word splitting as filename to specified path folder

Hello community! I learn Python and I try to make different actions with a text :

  1. Split text with NLTK regex.split

  1. Regex.split without empty results as '' and solo '-' except 'word-hyphen'
  2. Export each split as .txt with each split words as filename to a specified folder --> Regex.split without empty results as '' and solo '-' except 'word-hyphen' to not create empty files

Step 1 is done :

    # coding: utf-8
    import nltk
    s = "This sentence is in first place. This second sentence isn't in first place."
    import regex
    regex.split("[\s\.\,]", s)
    ['This', 'sentence', 'is', 'in', 'first', 'place', '', 'This', 'second', 'sentence', "isn't", 'in', 'first', 'place', '']

Step 2 and 3 is what I try to do :

2. Do not count empty results as '' and solo '-' except 'word-hyphen'

What is done for step 2 :

    # coding: utf-8
    import nltk
    s = "This sentence is in first place and contain a word-hyphen — Hello I am the second sentence and I'm in second place."
    import regex
    regex.split("[\s\.;!?…»,«\,]", s)
    ['This', 'sentence', 'is', 'in', 'first', 'place', 'and', 'contain', 'a', 'word-hyphen', '-', 'Hello', 'I', 'am', 'the', 'second', 'sentence', 'and', "I'm", 'in', 'second', 'place', '']

3. Export each split as .txt with each split words as filename to a specified folder


Someone knows how we can make something like that ?


Thanks for your help

CodePudding user response:

You're not using nltk's regular expression engine. Maybe you want RegexpTokenizer?

Because you're not using variables and have this "automatic printing", I'm guessing you're using command-line or IDLE. You'll have to use variables for the step 3, and at some point you'll have to use .py files too. Let's begin now; and if I'm wrong, sorry.

As you're asked to not have empty results in step 2, it indicates that you have a problem in step 1. Let's try with RegexpTokenizer then:

from nltk.tokenize import RegexpTokenizer
s = "This sentence is in first place. This second sentence isn't in first place."
tokenizer = RegexpTokenizer("[\s\.\,]", gaps=True)
split=tokenizer.tokenize(s)
print(split)

Output:

['This', 'sentence', 'is', 'in', 'first', 'place', 'This', 'second', 'sentence', "isn't", 'in', 'first', 'place']

No empty results here, we're good.

For step2, I don't understand your regex: just take the one from step1 "[\s\.\,]" and add the dash "[\s\.\,—]":

from nltk.tokenize import RegexpTokenizer
s = "This sentence is in first place and contain a word-hyphen — Hello I am the second sentence and I'm in second place."
tokenizer = RegexpTokenizer("[\s\.\,—]", gaps=True)
split=tokenizer.tokenize(s)
print(split)

Output:

['This', 'sentence', 'is', 'in', 'first', 'place', 'and', 'contain', 'a', 'word-hyphen', 'Hello', 'I', 'am', 'the', 'second', 'sentence', 'and', "I'm", 'in', 'second', 'place']

For the step 3, the simplest way should be this:

import os.path
path_to_files = 'C:\\Users\\username\\Desktop\\Split txt export'

for word in split:
    filename=word '.txt'
    fullpath=os.path.join(path_to_files, filename)
    with open(fullpath, 'w') as f:
        f.write(word)
  • Related