Convert a text file with different key value pairs to a csv file-CodePudding

I have a text file as shown below in the sample that I want to convert to a csv file (currently using Pandas).
The challenge is that I don't know in advance what the keys (column headers are) and in what order they are.
The final column order is not important.

Sample file:

name: john| dob: 10-06-1960| address: 4853 Radio Park Drive
name: jane| dob: 07-10-1973| address: 1537 Timbercrest Road| mobile: 706-289-6746
name: liam| dob: 12-08-1986| address: 4853 498 Fairmont Avenue| telephone number: 706-687-5021
name: chris| dob: 09-12-1965| address: 485 Green Avenue| state: California| Telephone Number: 510-855-5213

Desired output:

Name | dob        | address                  | mobile       | telephone number | state      |
----- ------------ -------------------------- -------------- ------------------ ------------ 
john | 10-06-1960 | 4853 Radio Park Drive    |              |                  |            |
jane | 07-10-1973 | 1537 Timbercrest Road    | 706-289-6746 |                  |            |
liam | 12-08-1986 | 4853 498 Fairmont Avenue |              | 706-687-5021     |            |
chris| 09-12-1965 | 485 Green Avenue         |              | 510-855-5213     | California |

My Code:

import pandas as pd

df = pd.DataFrame()

file = open('D:\sample.log', 'r')
        
lines = file.readlines()

for line in lines:

    pairs = line.split('|')

    my_dict = {}

    for pair in pairs:
       key = pair.split(': ')[0].strip()
       value = pair.split(': ')[1].strip()
            
       my_dict[key] = value

       df.append(my_dict, ignore_index=True)

This way of appending is very slow. How can I make this faster.

Or is there a much better solution (for example via a json string)?

CodePudding user response：

TL;DR:

pd.DataFrame.from_records(
    dict(field.split(': ') for field in line.split('|'))
    for line in lines
)

Long Version

Assuming you already split your data into lines you then need to process them into records such as:

{' address': '4853 Radio Park Drive', ' dob': '10-06-1960', 'name': 'john'}

Each line needs to be split into fields:

>>> line = 'name: john| dob: 10-06-1960| address: 4853 Radio Park Drive'
>>> line.split('|')
['name: john', ' dob: 10-06-1960', ' address: 4853 Radio Park Drive']

Then each field needs to be split into the name of the column and the value itself:

>>> field = 'name: John'
>>> field.split(': ')
['name', 'john']

Once you do this for every field in the line you end up with a list of these:

>>> [field.split(': ') for field in line.split('|')]
[['name', 'john'],
 [' dob', '10-06-1960'],
 [' address', '4853 Radio Park Drive']]

A dictionary initialised with this list gets you the record from the beginning of the answer.

Since you have many lines, you need to produce many records but it's better to produce these lazily, in other words using a generator:

>>> (dict(field.split(': ') for field in line.split('|')) for line in s.split('\n'))
<generator object <genexpr> at 0x7f0d06bf8dd0>

Rather than producing you a whole list of records, the generator gives you one at a time when you iterate over it. This way you can start forming your dataframe without having to wait for all the records to be processed.

There is a special syntax in Python called generator comprehension that let's you define generators to be passed as an argument to functions and constructors.

Putting it all together, we construct a dataframe using the appropriate constructor (from_records) and the generator defined above:

pd.DataFrame.from_records(
    dict(field.split(': ') for field in line.split('|'))
    for line in lines
)

This produces the following output:

    name         dob                   address        mobile  telephone number       state  Telephone Number
0   john  10-06-1960     4853 Radio Park Drive           NaN               NaN         NaN               NaN
1   jane  07-10-1973     1537 Timbercrest Road  706-289-6746               NaN         NaN               NaN
2   liam  12-08-1986  4853 498 Fairmont Avenue           NaN      706-687-5021         NaN               NaN
3  chris  09-12-1965          485 Green Avenue           NaN               NaN  California      510-855-5213

As a bonus, you can speed this up further by reading the file lazily too. Define a custom generator for reading lines:

def lines(path):
  with open(path) as file:
    while line := file.readline():
      yield line.rstrip()

Note this will only work with Python 3.8 . Otherwise, instead of using the walrus operator you need to do this instead:

def lines(path):
  with open(path) as file:
    while True:
      line = file.readline()
      if line:
        yield line.rstrip()
      else:
        return

CodePudding user response：

I think this is one of the fastest way to do this task, as it utilizes pandas library's built-in multiprocessing (which is written with c/c ) that is much faster than iterating over lines.

First read the whole text into an variable. and then,

import pandas as pd

data = '''name: john| dob: 10-06-1960| address: 4853 Radio Park Drive
name: jane| dob: 07-10-1973| address: 1537 Timbercrest Road| mobile: 706-289-6746
name: liam| dob: 12-08-1986| address: 4853 498 Fairmont Avenue| telephone number: 706-687-5021
name: chris| dob: 09-12-1965| address: 485 Green Avenue| state: California| Telephone Number: 510-855-5213'''


def get_dict(line_elems):
    line_dict = {}
    for elem in line_elems:
        k, v = elem.split(':')
        line_dict[k]=v
    return line_dict

df = pd.DataFrame({'lines':data.split('\n')})
df['line_list'] = df['lines'].apply(lambda x:x.split('|'))
df['line_dict'] = df['line_list'].apply(get_dict)
dict_list = df['line_dict'].tolist()
final_df = pd.DataFrame.from_dict(dict_list)
final_df

If you want I can explain the code, but let me know about its performance compared to other codes.

For reading the whole text from the file, you might use

with open('my_text_file',mode='r') as file:
    data= file.read()