Data file processing - Python-CodePudding

I have been assigned a task, and I cannot seem to figure out how I should proceed with it. It seemed relatively easy, but I must be doing something wrong, I would love it if someone could give me guidance. The code I have so far is:

infile = open("dataset.txt", "r")
outfile = open("emails.txt", "w")

def main():
  lines = infile.splitlines()
  for line in lines:
    if line.find("@", ".")!= -1:
      print(line)

infile.close()
outfile.close()

In dataset.txt is the 20 Newsgroups data set(which is the data set that I am working with). The task is this: Find all valid sender e-mails (the email Ids that have the symbol “@” and “.”) in the given data set and write them to a new file: “emails.txt” Make sure you also display the number of characters in each email beside the email address.Your file must not contain duplicate emails. At the end of your file, display the total numbers of emails in the file and their average length.

CodePudding user response：

Consider using Python's re (Regular Expressions) library to identify valid email addresses in the dataset file.

import re

EMAIL_PATTERN = r'([a-zA-Z0-9_\-\.] @[a-zA-Z0-9_\-\.] \.[a-zA-Z]{2,5})'

def main():
  with open('dataset.txt', 'r') as in_file:
    lines = in_file.read().splitlines()
    with open('emails.txt', 'a') as out_file:
        for line in lines:
            match = re.search(EMAIL_PATTERN, line)
            if match is not None:
                out_file.write(match.group(1)   '\n')

main()

CodePudding user response：

You defined a function but you didn't call it.
str.find method doesn't accept two strings. Its signature is str.find(sub[, start[, end]]). take a look at the documentation

As you've already done, you should iterate through the lines in the dataset.txt file. Then grab email addresses. Based on your data you can do it several ways.

It's totally fine to use str methods like what you did, I just decided to use a simple naive regex. This regex only checks if an string has @ and . inside in the mentioned order. There are definitely more concise patterns if you search.

Finally add the email addresses to a set so that duplicates are removed:

import re

pattern = re.compile(r"\b(?=\S*@\S*\.\S*)\S*\b")

emails = set()
for line in open("dataset.txt").readlines():
    if m := pattern.search(line):
        emails.add(m.group())

with open("emails.txt", "w") as f:
    for email in emails:
        print(len(email), email, file=f)

For .find() method, you should check "@" and "." in separate calls, not together. First check the return value of .find("@") and then check too see if there is any dot after this by .find(".", N 1). N is the returned value of the previous find.