I am trying to create a data frame from a data set of 1000 .txt files, then loop through the files and gets the title, Author, language, etc to form a single data frame.
from glob import glob
files = glob('dataset/*.txt')
files.sort()
files
for n in files:
with open(n, 'r') as text_file:
text = text_file.read()
# These can be reused for each book
title = re.compile(r'Title: (.*)\n')
author = re.compile(r'Author: (.*)\n')
release_date = re.compile(r'Release Date: (.*)\s')
language = re.compile(r'Language: (.*)\n')
book_title = title.search(text).group(1)
# book_author = author.search(text).group(1)
book_language = language.search(text).group(1)
book_release = release_date.search(text).group(1).split(' [')[0]
books = pd.DataFrame({"Title": [book_title], "Author": [book_author],
"Release_Date": [book_release], "Language": [book_language]})
books
this displays only a single data but when I use PRINT it displays all data but as separate data frames.
How do I display all these frames as one single data frame?
CodePudding user response:
Something like this may work, however, it must be said, that this program will generate an error if ANY of the book details are not found in ANY of the books.
Code:
book_title, book_author, book_release, book_language = [],[],[],[]
# These can be reused for each book
title = re.compile(r'Title: (.*)\n')
author = re.compile(r'Author: (.*)\n')
release_date = re.compile(r'Release Date: (.*)\s')
language = re.compile(r'Language: (.*)\n')
for n in files:
with open(n, 'r') as text_file:
text = text_file.read()
book_title.append(title.search(text).group(1))
book_author.append(author.search(text).group(1))
book_language.append(language.search(text).group(1))
book_release.append(release_date.search(text).group(1).split(' [')[0])
books = pd.DataFrame({"Title": book_title, "Author": book_author,
"Release_Date": book_release, "Language": book_language})
Note:
To handle issues for when you are missing data from a book you could employ this type of technique:
if author.search(text) is not None:
book_author.append(author.search(text).group(1))
else:
book_author.append('-')