Converting a list of dictionary-like objects containing repeated variable names and variable values-CodePudding

I am looking to convert the data in this PDF document of salaries at Oregon State University to a pandas dataframe. As you can see, in the document, the structure is such that what will become variable names are repeatedly throughout the document, followed by a colon and the column value for each observation. I've pulled in each line of this document using PDFBox and been able to clean things so that they end up in a list of tuples like the sample below. The basic thing I want to do is take this list of tuples, which has the variable names appearing repeatedly in element one of the tuple and the variable values in the second element and convert it to a dataframe where those values are all aligned under the correct corresponding columns.

data = ## Heading ##[('Name', ' Abbas, Houssam'),
     ('First Hired', ' 31-DEC-2018'),
     ('Home Orgn', ' ESE - Sch Elect Engr/Comp Sci'),
     ('Adj Service Date', ' 31-DEC-2018'),
     ('Job Orgn', ' ESE - Sch Elect Engr/Comp Sci'),
     ('Job Type', ' P'),
     ('Job Title', ' Assistant Professor'),
     ('Posn-Suff', ' C18336-00'),
     ('Rank', ' Assistant Professor'),
     ('Rank Effective Date', ' 31-DEC-2018'),
     ('Appt Begin Date', ' 31-DEC-2018'),
     ('Appt Percent', ' 100'),
     ('Appt End Date', ' N/A'),
     ('Annual Salary Rate', '   92961.00    9 mo'),
     ('Name', ' Abbasi, Bahman'),
     ('First Hired', ' 01-AUG-2017'),
     ('Home Orgn', ' LCB - Acad Prog / Student Aff'),
     ('Adj Service Date', ' 01-AUG-2017'),
     ('Job Orgn', ' EMM - Sch of Mech/Ind/Mfg Engr'),
     ('Job Type', ' O'),
     ('Job Title', ' Assistant Professor'),
     ('Posn-Suff', ' C18194-00'),
     ('Rank', ' Assistant Professor'),
     ('Rank Effective Date', ' 16-SEP-2017'),
     ('Appt Begin Date', ' 16-SEP-2017'),
     ('Appt Percent', ' 100'),
     ('Appt End Date', ' N/A'),
     ('Annual Salary Rate', '   97659.00    9 mo'),
     ('Job Orgn', ' LCB - Acad Prog / Student Aff'),
     ('Job Type', ' P'),
     ('Job Title', ' Assistant Professor'),
     ('Posn-Suff', ' C11566-00'),
     ('Rank', ' Assistant Professor'),
     ('Rank Effective Date', ' 16-SEP-2017'),
     ('Appt Begin Date', ' 16-SEP-2020'),
     ('Appt Percent', ' 100'),
     ('Appt End Date', ' 15-JUN-2021'),
     ('Annual Salary Rate', '   98811.00    9 mo')]

Where the (abbreviated version of the) desired final dataframe would be:

| Name           | First Hired | Home Orgn                     | Adj Service Date | 
| Abbas, Houssam | 31-DEC-2018 | ESE - Sch Elect Engr/Comp Sci | 31-DEC-2018      | 
| Abbasi, Bahman | 01-AUG-2017 | LCB - Acad Prog / Student Aff | 01-AUG-2017      |

I have tried variations on the solutions presented here, here, and here, but haven't had any luck. Any advice (including perhaps a different way to process the original raw data) would be greatly appreciated!

CodePudding user response：

You just convert the list of tuples to a list of dictionaries. Pandas will do the rest:

import pandas as pd

data = [('Name', ' Abbas, Houssam'),
     ('First Hired', ' 31-DEC-2018'),
     ('Home Orgn', ' ESE - Sch Elect Engr/Comp Sci'),
     ('Adj Service Date', ' 31-DEC-2018'),
     ('Job Orgn', ' ESE - Sch Elect Engr/Comp Sci'),
     ('Job Type', ' P'),
     ('Job Title', ' Assistant Professor'),
     ('Posn-Suff', ' C18336-00'),
     ('Rank', ' Assistant Professor'),
     ('Rank Effective Date', ' 31-DEC-2018'),
     ('Appt Begin Date', ' 31-DEC-2018'),
     ('Appt Percent', ' 100'),
     ('Appt End Date', ' N/A'),
     ('Annual Salary Rate', '   92961.00    9 mo'),
     ('Name', ' Abbasi, Bahman'),
     ('First Hired', ' 01-AUG-2017'),
     ('Home Orgn', ' LCB - Acad Prog / Student Aff'),
     ('Adj Service Date', ' 01-AUG-2017'),
     ('Job Orgn', ' EMM - Sch of Mech/Ind/Mfg Engr'),
     ('Job Type', ' O'),
     ('Job Title', ' Assistant Professor'),
     ('Posn-Suff', ' C18194-00'),
     ('Rank', ' Assistant Professor'),
     ('Rank Effective Date', ' 16-SEP-2017'),
     ('Appt Begin Date', ' 16-SEP-2017'),
     ('Appt Percent', ' 100'),
     ('Appt End Date', ' N/A'),
     ('Annual Salary Rate', '   97659.00    9 mo'),
     ('Job Orgn', ' LCB - Acad Prog / Student Aff'),
     ('Job Type', ' P'),
     ('Job Title', ' Assistant Professor'),
     ('Posn-Suff', ' C11566-00'),
     ('Rank', ' Assistant Professor'),
     ('Rank Effective Date', ' 16-SEP-2017'),
     ('Appt Begin Date', ' 16-SEP-2020'),
     ('Appt Percent', ' 100'),
     ('Appt End Date', ' 15-JUN-2021'),
     ('Annual Salary Rate', '   98811.00    9 mo')]

rows = []
for k,v in data:
    if k == 'Name':
        rows.append({})
    rows[-1][k]=v
#print(rows)
df = pd.DataFrame(rows)
print(df)

Output:

              Name   First Hired  ... Appt End Date   Annual Salary Rate
0   Abbas, Houssam   31-DEC-2018  ...           N/A     92961.00    9 mo
1   Abbasi, Bahman   01-AUG-2017  ...   15-JUN-2021     98811.00    9 mo

[2 rows x 14 columns]