Text data manipulation in Python-CodePudding

I have a txt file with some data I want to clean and export as csv but the format is too messed up . The lines in the txt file are in this format

[email protected]:specialcode | Status - 2022-11-25

[email protected]:anothercode | Status - 2023-08-15

[email protected]:codeworcd | Status - 2036-06-19

and so one

I want to convert the lines to

[email protected] , specialcode , Status , 2022-11-25

[email protected], anothercode , Status , 2023-08-15 

[email protected], codeworcd, Status, 2036-06-19

So that i can save the file as csv.

How can I approach such a complex situation? I can loop over the lines and split it with split(‘:’) but each character is different. So it appears more challenging.

Thanks

CodePudding user response：

With your shown sample please try the following:

import csv, re

with open("file.txt") as fi, open("output.csv", "w") as fo:
    writer = csv.writer(fo)
    for line in fi:
        l = re.split(r':| [|-] ', line.rstrip())
        writer.writerow(l)

Result:

[email protected],specialcode,Status,2022-11-25
[email protected],anothercode,Status,2023-08-15
[email protected],codeworcd,Status,2036-06-19

It assumes the input filename is "file.txt" and the output filename is "output.csv".
The delimiter is defined as :| [|-] . It splits the line on a colon or a sequence of a whitespace, pipe or hyphen, and a whitespace. The important thing is the pipe character and hyphen are surrounded by whitespaces as shown in your sample.

CodePudding user response：

here is one way to do it without regex

assuming that you have a text file in filesystem, that you are or can read using read_csv

# read in the text file, and name the columns, assuming there is only one '|' in file

df=pd.read_csv(r'csv.csv', sep='|', header=None, names=['col1','col2'])

# split col1 on colon
df[['email','code']]=df['col1'].str.split(':', expand=True)

# split only one occurrence on hyphen
df[['status','date']]=df['col2'].str.split('-', 1,expand=True)

# drop the read-in columns
df=df.drop(columns=['col1','col2'])

# write to csv
df.to_csv(r'csvout.csv')

    email                 code          status  date
0   [email protected]    specialcode   Status  2022-11-25
1   [email protected] anothercode   Status  2023-08-15
2   [email protected]    codeworcd     Status  2036-06-19