I have a text file as shown below in the sample that I want to convert to a csv file (currently using Pandas).
The challenge is that I don't know in advance what the keys (column headers are) and in what order they are.
The final column order is not important.
Sample file:
name: john| dob: 10-06-1960| address: 4853 Radio Park Drive
name: jane| dob: 07-10-1973| address: 1537 Timbercrest Road| mobile: 706-289-6746
name: liam| dob: 12-08-1986| address: 4853 498 Fairmont Avenue| telephone number: 706-687-5021
name: chris| dob: 09-12-1965| address: 485 Green Avenue| state: California| Telephone Number: 510-855-5213
Desired output:
Name | dob | address | mobile | telephone number | state |
----- ------------ -------------------------- -------------- ------------------ ------------
john | 10-06-1960 | 4853 Radio Park Drive | | | |
jane | 07-10-1973 | 1537 Timbercrest Road | 706-289-6746 | | |
liam | 12-08-1986 | 4853 498 Fairmont Avenue | | 706-687-5021 | |
chris| 09-12-1965 | 485 Green Avenue | | 510-855-5213 | California |
My Code:
import pandas as pd
df = pd.DataFrame()
file = open('D:\sample.log', 'r')
lines = file.readlines()
for line in lines:
pairs = line.split('|')
my_dict = {}
for pair in pairs:
key = pair.split(': ')[0].strip()
value = pair.split(': ')[1].strip()
my_dict[key] = value
df.append(my_dict, ignore_index=True)
This way of appending is very slow. How can I make this faster.
Or is there a much better solution (for example via a json string)?
CodePudding user response:
TL;DR:
pd.DataFrame.from_records(
dict(field.split(': ') for field in line.split('|'))
for line in lines
)
Long Version
Assuming you already split your data into lines you then need to process them into records such as:
{' address': '4853 Radio Park Drive', ' dob': '10-06-1960', 'name': 'john'}
Each line needs to be split into fields:
>>> line = 'name: john| dob: 10-06-1960| address: 4853 Radio Park Drive'
>>> line.split('|')
['name: john', ' dob: 10-06-1960', ' address: 4853 Radio Park Drive']
Then each field needs to be split into the name of the column and the value itself:
>>> field = 'name: John'
>>> field.split(': ')
['name', 'john']
Once you do this for every field in the line you end up with a list of these:
>>> [field.split(': ') for field in line.split('|')]
[['name', 'john'],
[' dob', '10-06-1960'],
[' address', '4853 Radio Park Drive']]
A dictionary initialised with this list gets you the record from the beginning of the answer.
Since you have many lines, you need to produce many records but it's better to produce these lazily, in other words using a generator:
>>> (dict(field.split(': ') for field in line.split('|')) for line in s.split('\n'))
<generator object <genexpr> at 0x7f0d06bf8dd0>
Rather than producing you a whole list of records, the generator gives you one at a time when you iterate over it. This way you can start forming your dataframe without having to wait for all the records to be processed.
There is a special syntax in Python called generator comprehension that let's you define generators to be passed as an argument to functions and constructors.
Putting it all together, we construct a dataframe using the appropriate constructor (from_records
) and the generator defined above:
pd.DataFrame.from_records(
dict(field.split(': ') for field in line.split('|'))
for line in lines
)
This produces the following output:
name dob address mobile telephone number state Telephone Number
0 john 10-06-1960 4853 Radio Park Drive NaN NaN NaN NaN
1 jane 07-10-1973 1537 Timbercrest Road 706-289-6746 NaN NaN NaN
2 liam 12-08-1986 4853 498 Fairmont Avenue NaN 706-687-5021 NaN NaN
3 chris 09-12-1965 485 Green Avenue NaN NaN California 510-855-5213
As a bonus, you can speed this up further by reading the file lazily too. Define a custom generator for reading lines:
def lines(path):
with open(path) as file:
while line := file.readline():
yield line.rstrip()
Note this will only work with Python 3.8 . Otherwise, instead of using the walrus operator you need to do this instead:
def lines(path):
with open(path) as file:
while True:
line = file.readline()
if line:
yield line.rstrip()
else:
return
CodePudding user response:
I think this is one of the fastest way to do this task, as it utilizes pandas library's built-in multiprocessing (which is written with c/c ) that is much faster than iterating over lines.
First read the whole text into an variable. and then,
import pandas as pd
data = '''name: john| dob: 10-06-1960| address: 4853 Radio Park Drive
name: jane| dob: 07-10-1973| address: 1537 Timbercrest Road| mobile: 706-289-6746
name: liam| dob: 12-08-1986| address: 4853 498 Fairmont Avenue| telephone number: 706-687-5021
name: chris| dob: 09-12-1965| address: 485 Green Avenue| state: California| Telephone Number: 510-855-5213'''
def get_dict(line_elems):
line_dict = {}
for elem in line_elems:
k, v = elem.split(':')
line_dict[k]=v
return line_dict
df = pd.DataFrame({'lines':data.split('\n')})
df['line_list'] = df['lines'].apply(lambda x:x.split('|'))
df['line_dict'] = df['line_list'].apply(get_dict)
dict_list = df['line_dict'].tolist()
final_df = pd.DataFrame.from_dict(dict_list)
final_df
If you want I can explain the code, but let me know about its performance compared to other codes.
For reading the whole text from the file, you might use
with open('my_text_file',mode='r') as file:
data= file.read()