Home > Software engineering >  How would I extract & organize data from a txt file using python?
How would I extract & organize data from a txt file using python?

Time:12-10

Situation: I have a flat file of data with various elements in it and I need to extract specific portions. I am a beginner in Python and wrote it out using Regular Expressions and other functions. Here is a sample of the data from the txt file I receive:


**ACCESSORID = FS01234**   TYPE       = USER      SIZE         =     1024  BYTES
**NAME       =    JOHN SMITH**                                                                                                                                                              FACILITY   = TSO                                                                                                                    
 DEPT ACID  = D12RGRD  DEPARTMENT = TRAINING                                                                       
 DIV ACID   = NR       DIVISION   = NRE                                                                               
 CREATED    = 01/17/05  00:00  LAST MOD   = 11/16/21  10:42                                                                          
 **PROFILES   = VPSNRE     P11NR00A**                                                                                                     
 LAST USED  = 12/02/21 09:03 CPU(SYSB) FAC(SUPRSESS) COUNT(06051)                                                                    
 **XA SSN     = 123456789**                                     OWNER(JB112)                                                          
 XA TSOACCT = 123456789                                    OWNER(JB112 )                                                          
 XA TSOAUTH = JCL                                           OWNER(JB112  )                                                          
 XA TSOAUTH = RECOVER                                       OWNER(JB112 )                                                          
 XA TSOPROC = NR005PROC                                      OWNER(JB112 )                                                          
 -----------  SEGMENT TSO                                                                                                            
 TRBA       = NON-DISPLAY FIELD                                                                                                 
 TSOCOMMAND =                                                                                                                        
 TSODEFPRFG =                                                                                                                        
 TSOLACCT   = 111111111                                                                                                            
 TSOLPROC   = NR9923PROC                                                                                                               
 TSOLSIZE   = 0004096                                                                                                                
 TSOOPT     = MAIL,NONOTICES,NOOIDCARD                                                                                               
 TSOUDATA   = 0000                                                                                                                   
 TSOUNIT    = SYSDD                                                                                                                  
 TUPT       = NON-DISPLAY FIELD   
----------- SEGMENT USER
**EMAIL ADDR = [email protected]**                                                                                               

The portions I need to extract are bolded. I know I need to provide what I have done so far and without posting my entire script, here is what I am doing to extract the ACCESSORID = FS01234 and NAME = JOHN SMITH portion.

def RemoveSpace():
    f = open("PROJECTFILE.txt","r")
    f1 = open("RemoveSpace.txt", "w")
    data1 = f.read()
    word = data1.split()
    s = ' '.join(word)
    f1.write(s)
    print("Data Written Successfully")
    RemoveSpace()


f = open(r"C:\Users\user\Desktop\HR\PROJECTFILE\RemoveSpace.txt".format(g), "r").read()

TSS = []

 contents = re.split(r"ACCESSORID =",f)
 contents.pop(0)

for item in contents:
TSS_DICT = {}

emplid = re.search(r"FS.*", item)

if emplid is not None:
    s_emplid = re.search("FS\w*", emplid.group())
else:
    s_emplid = None
    
if s_emplid is not None:
    s_emplid = s_emplid.group()
else:
    s_emplid = None

TSS_DICT["EMPLOYEE ID"] = s_emplid

name = re.search(r"NAME =.*", item)

if name is not None:
    emp_name = re.search("[^NAME = ][^,]*", name.group())
else:
    emp_name = None

if emp_name is not None:
    emp_name = emp_name.group()
else:
    emp_name = None

TSS_DICT["EMPLOYEE NAME"] = emp_name

Question: I am having some difficulty getting John Smith. It keeps bringing in everything after John Smith down to very last line of email address. My end goal is to get a CSV file with each bolded item as its own column. And more directly speaking, how would experts approach this data clean up approach to simplify the process? If needed I can post full code but didn't want to muddle this up anymore than needed.

CodePudding user response:

For practising your Regex, I recommend using a website like RegExr. Here, you can paste the text that you want to match and you can play around with different matching expressions to get the result that you intend.

Assuming that you want to use this code for multiple files of the same organisation and that the data is formatted the same way in each, you can simplify your code a lot.

Let's say we wanted to extract NAME = JOHN SMITH from the text file. We could write the following Python code to do this:

import re
pattern = "NAME = \\w  \\w "
name = re.findall(pattern, text_to_search)[0][7:]
print(name)

pattern is our Regex search expression. text_to_search is your text file that you have read into your Python script. re.findall() returns a list of matched items that we then access the first index of with [0]. We can then use string slicing ([7:]) to remove the NAME = bit.

The above code would output the following:

JOHN SMITH

You should be able to apply the same principles to the other bold sections of your text file.

In terms of writing your extracted data out to a CSV file, it is probably worth reading a good tutorial on this. For example Reading and Writing CSV Files in Python. There are a few different ways of storing your information before writing, such as lists vs dictionaries. But you can write CSV files either with built-in Python tools or manually.

  • Related