Home > Enterprise >  How to read text file's key, value pair using pandas and write to csv?
How to read text file's key, value pair using pandas and write to csv?

Time:09-12

I want to parse a text file which contains following data.
The first four columns (Name, Date, Time and Activity) are always in the same order after that there is no fixed structure anymore

Input.log

Name: John Doe| Date: 30072022 | Time: 05:06| Activity: Call
Name: John Doe| Date: 30072022 | Time: 05:07| Activity: Search| Type: 99
Name: John Doe| Date: 30072022 | Time: 05:07| Activity: Search| Type: 99| Customer: 123456
Name: Hannah Parkes| Date: 10082022| Time: 13:16| Activity: Archiving| Customer: 0465755
Name: Shannon Harper| Date: 11082022| Time: 16:0 |Activity: Call
Name: Shannon Harper| Date: 11082022| Time: 17:15|Activity: Call Activity: Search| Type: 99| Customer: 985643
Name: Shannon Harper| Date: 11082022| Time: 17:16|Activity: Call Activity: Document| Customer: 985643| Customer type: FHR
Name:Rachel Bolton  | Date: 14092022| Time: 09:35|Activity: Search|Contract: 56575757| StartDate: 01-01-2021| EndDate: 29-07-2022

Desired output (to CSV)

 -------------- -------- ----- --------- ---- -------- ------------- -------- ---------- ---------- 
|Name          |Date    |Time |Activity |Type|Customer|Customer type|Contract|StartDate |EndDate   |
 -------------- -------- ----- --------- ---- -------- ------------- -------- ---------- ---------- 
|John Doe      |30072022|05:06|Call     |    |        |             |        |          |          |
|John Doe      |30072022|05:07|Search   |99  |        |             |        |          |          |
|John Doe      |30072022|05:07|Search   |99  |123456  |             |        |          |          |
|Hannah Parkes |10082022|13:16|Archiving|    |0465755 |             |        |          |          |
|Shannon Harper|11082022|16:05|Call     |    |        |             |        |          |          |
|Shannon Harper|11082022|17:15|Search   |99  |985643  |             |        |          |          |
|Shannon Harper|11082022|17:16|Document |    |985643  |FHR          |        |          |          |
|Rachel Bolton |14092022|09:35|Search   |    |        |             |56575757|01-01-2021|29-07-2022|
 -------------- -------- ----- --------- ---- -------- ------------- -------- ---------- ---------- 

Already tried something like this;

import pandas as pd

log_file = 'Input.log'
df = pd.read_csv(log_file, sep=";", index_col=None, names=['text'])
s = df.text.str.split('|', expand=True).stack().str.split(': ', expand=True)
print(s)

Any help to get this done is greatly appreciated.

CodePudding user response:

I change data by expected ouput (remove Activity: Call ), then is possible change your solution with MultiIndex and reshape by DataFrame.reindex:

s = df.text.str.split('|', expand=True).stack().str.split(':', expand=True).apply(lambda x: x.str.strip())

ordering= s[0].unique()
df = s.droplevel(1).set_index(0, append=True)[1].unstack().reindex(ordering, axis=1)

print(df)
0            Name      Date   Time   Activity Type Customer Customer type  \
0        John Doe  30072022  05:06       Call  NaN      NaN           NaN   
1        John Doe  30072022  05:07     Search   99      NaN           NaN   
2        John Doe  30072022  05:07     Search   99   123456           NaN   
3   Hannah Parkes  10082022  13:16  Archiving  NaN  0465755           NaN   
4  Shannon Harper  11082022   16:0       Call  NaN      NaN           NaN   
5  Shannon Harper  11082022  17:15     Search   99   985643           NaN   
6  Shannon Harper  11082022  17:16   Document  NaN   985643           FHR   
7   Rachel Bolton  14092022  09:35     Search  NaN      NaN           NaN   

0  Contract   StartDate     EndDate  
0       NaN         NaN         NaN  
1       NaN         NaN         NaN  
2       NaN         NaN         NaN  
3       NaN         NaN         NaN  
4       NaN         NaN         NaN  
5       NaN         NaN         NaN  
6       NaN         NaN         NaN  
7  56575757  01-01-2021  29-07-2022  
  • Related