Home > Mobile >  Text data manipulation in Python
Text data manipulation in Python

Time:10-12

I have a txt file with some data I want to clean and export as csv but the format is too messed up . The lines in the txt file are in this format

[email protected]:specialcode | Status - 2022-11-25

[email protected]:anothercode | Status - 2023-08-15

[email protected]:codeworcd | Status - 2036-06-19

and so one

I want to convert the lines to

[email protected] , specialcode , Status , 2022-11-25

[email protected], anothercode , Status , 2023-08-15 

[email protected], codeworcd, Status, 2036-06-19

So that i can save the file as csv.

How can I approach such a complex situation? I can loop over the lines and split it with split(‘:’) but each character is different. So it appears more challenging.

Thanks

CodePudding user response:

With your shown sample please try the following:

import csv, re

with open("file.txt") as fi, open("output.csv", "w") as fo:
    writer = csv.writer(fo)
    for line in fi:
        l = re.split(r':| [|-] ', line.rstrip())
        writer.writerow(l)

Result:

[email protected],specialcode,Status,2022-11-25
[email protected],anothercode,Status,2023-08-15
[email protected],codeworcd,Status,2036-06-19
  • It assumes the input filename is "file.txt" and the output filename is "output.csv".
  • The delimiter is defined as :| [|-] . It splits the line on a colon or a sequence of a whitespace, pipe or hyphen, and a whitespace. The important thing is the pipe character and hyphen are surrounded by whitespaces as shown in your sample.

CodePudding user response:

here is one way to do it without regex

assuming that you have a text file in filesystem, that you are or can read using read_csv

# read in the text file, and name the columns, assuming there is only one '|' in file

df=pd.read_csv(r'csv.csv', sep='|', header=None, names=['col1','col2'])

# split col1 on colon
df[['email','code']]=df['col1'].str.split(':', expand=True)

# split only one occurrence on hyphen
df[['status','date']]=df['col2'].str.split('-', 1,expand=True)

# drop the read-in columns
df=df.drop(columns=['col1','col2'])

# write to csv
df.to_csv(r'csvout.csv')


    email                 code          status  date
0   [email protected]    specialcode   Status  2022-11-25
1   [email protected] anothercode   Status  2023-08-15
2   [email protected]    codeworcd     Status  2036-06-19

  • Related