I want to parse a text file which contains following data.
The first four columns (Name, Date, Time and Activity) are always in the same order after that there is no fixed structure anymore
Input.log
Name: John Doe| Date: 30072022 | Time: 05:06| Activity: Call
Name: John Doe| Date: 30072022 | Time: 05:07| Activity: Search| Type: 99
Name: John Doe| Date: 30072022 | Time: 05:07| Activity: Search| Type: 99| Customer: 123456
Name: Hannah Parkes| Date: 10082022| Time: 13:16| Activity: Archiving| Customer: 0465755
Name: Shannon Harper| Date: 11082022| Time: 16:0 |Activity: Call
Name: Shannon Harper| Date: 11082022| Time: 17:15|Activity: Call Activity: Search| Type: 99| Customer: 985643
Name: Shannon Harper| Date: 11082022| Time: 17:16|Activity: Call Activity: Document| Customer: 985643| Customer type: FHR
Name:Rachel Bolton | Date: 14092022| Time: 09:35|Activity: Search|Contract: 56575757| StartDate: 01-01-2021| EndDate: 29-07-2022
Desired output (to CSV)
-------------- -------- ----- --------- ---- -------- ------------- -------- ---------- ----------
|Name |Date |Time |Activity |Type|Customer|Customer type|Contract|StartDate |EndDate |
-------------- -------- ----- --------- ---- -------- ------------- -------- ---------- ----------
|John Doe |30072022|05:06|Call | | | | | | |
|John Doe |30072022|05:07|Search |99 | | | | | |
|John Doe |30072022|05:07|Search |99 |123456 | | | | |
|Hannah Parkes |10082022|13:16|Archiving| |0465755 | | | | |
|Shannon Harper|11082022|16:05|Call | | | | | | |
|Shannon Harper|11082022|17:15|Search |99 |985643 | | | | |
|Shannon Harper|11082022|17:16|Document | |985643 |FHR | | | |
|Rachel Bolton |14092022|09:35|Search | | | |56575757|01-01-2021|29-07-2022|
-------------- -------- ----- --------- ---- -------- ------------- -------- ---------- ----------
Already tried something like this;
import pandas as pd
log_file = 'Input.log'
df = pd.read_csv(log_file, sep=";", index_col=None, names=['text'])
s = df.text.str.split('|', expand=True).stack().str.split(': ', expand=True)
print(s)
Any help to get this done is greatly appreciated.
CodePudding user response:
I change data by expected ouput (remove Activity: Call
), then is possible change your solution with MultiIndex
and reshape by DataFrame.reindex
:
s = df.text.str.split('|', expand=True).stack().str.split(':', expand=True).apply(lambda x: x.str.strip())
ordering= s[0].unique()
df = s.droplevel(1).set_index(0, append=True)[1].unstack().reindex(ordering, axis=1)
print(df)
0 Name Date Time Activity Type Customer Customer type \
0 John Doe 30072022 05:06 Call NaN NaN NaN
1 John Doe 30072022 05:07 Search 99 NaN NaN
2 John Doe 30072022 05:07 Search 99 123456 NaN
3 Hannah Parkes 10082022 13:16 Archiving NaN 0465755 NaN
4 Shannon Harper 11082022 16:0 Call NaN NaN NaN
5 Shannon Harper 11082022 17:15 Search 99 985643 NaN
6 Shannon Harper 11082022 17:16 Document NaN 985643 FHR
7 Rachel Bolton 14092022 09:35 Search NaN NaN NaN
0 Contract StartDate EndDate
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 56575757 01-01-2021 29-07-2022