How do I transform a non-CSV text file into a CSV using Python/Pandas?-CodePudding

I have a text file that looks like this:

Id Number: 12345678
Location: 1234561791234567090-8.9
Street: 999 Street AVE
Buyer: john doe
Id Number: 12345688
Location: 3582561791254567090-8.9
Street: 123 Street AVE
Buyer: Jane doe @ buyer % LLC
Id Number: 12345689
Location: 8542561791254567090-8.9
Street: 854 Street AVE
Buyer: Jake and Bob: Owner%LLC: Inc

I'd like the file to look like this:

Id Number	Location	Street	Buyer
12345678	1234561791234567090-8.9	999 Street AVE	john doe
12345688	3582561791254567090-8.9	123 Street AVE	Jane doe @ buyer % LLC
12345689	8542561791254567090-8.9	854 Street AVE	Jake and Bob: Owner%LLC: Inc

I have tried the following:

# 1 Read text file and ignore bad lines (lines with extra colons thus reading as extra fields).
tr = pd.read_csv('C:\\File Path\\test.txt', sep=':', header=None, error_bad_lines=False)

# 2 Convert into a dataframe/pivot table.
ndf = pd.DataFrame(tr.pivot(index=None, columns=0, values=1))

# 3 Clean up the pivot table to remove NaNs and reset the index (line by line).

nf2 = ndf.apply(lambda x: x.dropna().reset_index(drop=True))

Here is where got the last line (#3): https://stackoverflow.com/a/62481057/10448224

When I do the above and export to CSV the headers are arranged like the following:

(index)	Street	Buyer	Id Number	Location

The data is filled in nicely but at some point the Buyer field becomes inaccurate but the rest of the fields are accurate through the entire DF.

My guesses:

When I run #1 part of my script I get the following errors 507 times:

b'Skipping line 500: expected 2 fields, saw 3\nSkipping line 728: expected 2 fields, saw 3\

At the tail end of the new DF I am missing exactly 507 entries for the Byer field. So I think when I drop my bad lines, the field is pushing my data up.

Pain Points:

The Buyer field will sometimes have extra colons and other odd characters. So when I try to use a colon as a delimiter I run into problems.

I am new to Python and I am very new to using functions. I primarily use Pandas to manipulate data at a somewhat basic level. So in the words of the great Michael Scott: "Explain it to me like I'm five." Many many thanks to anyone willing to help.

CodePudding user response：

This is a minimal example that demonstrates the basics:

cat split_test.txt 

Id Number: 12345678
Location: 1234561791234567090-8.9
Street: 999 Street AVE
Buyer: john doe
Id Number: 12345688
Location: 3582561791254567090-8.9
Street: 123 Street AVE
Buyer: Jane doe @ buyer % LLC
Id Number: 12345689
Location: 8542561791254567090-8.9
Street: 854 Street AVE
Buyer: Jake and Bob: Owner%LLC: Inc

import csv 

with open("split_test.txt", "r") as f:
    id_val = "Id Number"
    list_var = []
    for line in f:
        split_line = line.strip().split(':')
        print(split_line)
        if split_line[0] == id_val:
            d = {}
            d[split_line[0]] = split_line[1]
            list_var.append(d)
        else:
            d.update({split_line[0]: split_line[1]})

list_var

[{'Id Number': ' 12345689',
  'Location': ' 8542561791254567090-8.9',
  'Street': ' 854 Street AVE',
  'Buyer': ' Jake and Bob'},
 {'Id Number': ' 12345678',
  'Location': ' 1234561791234567090-8.9',
  'Street': ' 999 Street AVE',
  'Buyer': ' john doe'},
 {'Id Number': ' 12345688',
  'Location': ' 3582561791254567090-8.9',
  'Street': ' 123 Street AVE',
  'Buyer': ' Jane doe @ buyer % LLC'}]

with open("split_ex.csv", "w") as csv_file:
    field_names = list_var[0].keys()
    csv_writer = csv.DictWriter(csv_file, fieldnames=field_names)
    csv_writer.writeheader()
    for row in list_var:
        csv_writer.writerow(row)

CodePudding user response：

Here's what I meant by reading in and using split. Very similar to other answers. Untested and I don't recall if inputline include eol, so I stripped it too.

with open('myfile.txt') as f:
    data = []  # holds database
    record = {} # holds built up record
    for inputline in f:
        key,value = inputline.strip().split(':',1)
        if key == "Id Number": # new record starting
            if len(record):
                data.append(record) # write previous record
                record = {}
        record.update({key:value})
if len(record):
    data.append(record) # out final record
df = pd.DataFrame(data)

CodePudding user response：

I would try reading the file line by line, splitting the key-value pairs into a list of dicts to look something like:

data = [
  {
     "Id Number": 12345678,
     "Location": 1234561791234567090-8.9,
     ...
  },
  {
     "Id Number": ...
  }
]

# easy to create the dataframe from here
your_df = pd.DataFrame(data)