I have a table with a dataset something like the following. Its 10 cols and 30 rows. The order of the records in the 'names' dictionary cell is important. There can be more than 1 of each Title.
data =[
['4/18/2005', [{'grantor': 'Company1'}, {'grantee': 'Company2'}]],
['3/29/2005', [{'grantor': 'Company3'}, {'grantor': 'Company1'}, {'grantor': 'Company4'}, {'grantee': 'Company5'}, {'grantee': 'Company2'}]],
['3/29/2005', [{'grantor': 'Company2'}, {'grantor': 'Company9'}, {'grantor': 'Apple'}, {'grantee': 'CompnayX'}, {'grantee': 'CompanyY'}, {'grantee': 'CompanyR'}]]
]
df = pd.DataFrame(data=data, columns=['fdate', 'names'])
I would like to do the 2 following tasks:
- Process a single row. I would like to read the Names cell for a selected row, and convert it into something like this (row 1)
data = [
[{'Title': 'grantor', 'Company': 'Company1'}],
[{'Title': 'grantee', 'Company': 'Company2'}]
]
df = pd.DataFrame(data)
- Expand the entire data set
data = [
[{'fdate': '4/18/2005', 'pos':'1', 'Title': 'grantor', 'Company': 'Company1'}],
[{'fdate': '4/18/2005', 'pos':'2', 'Title': 'grantee', 'Company': 'Company2'}],
[{'fdate': '3/29/2005', 'pos':'1', 'Title': 'grantor', 'Company': 'Company3'}],
[{'fdate': '3/29/2005', 'pos':'2', 'Title': 'grantor', 'Company': 'Company1'}],
[{'fdate': '3/29/2005', 'pos':'3', 'Title': 'grantor', 'Company': 'Company4'}],
[{'fdate': '3/29/2005', 'pos':'4', 'Title': 'grantor', 'Company': 'Company5'}],
[{'fdate': '3/29/2005', 'pos':'5', 'Title': 'grantee', 'Company': 'Company2'}]
]
df = pd.DataFrame(data)
CodePudding user response:
Here is my attempt at a solution:
I have labelled the parts where Task 1 is complete
import pprint
import pandas as pd
#Task 1
def process_row(datum, i):
ret = []
new_dict = {}
information = datum[i][1]
for info in information:
for (key, value) in info.items():
new_dict['Title'] = key
new_dict['Company'] = value
ret.append([new_dict.copy()])
new_dict.clear()
return ret
#Need to create two dicts for Task 2
def process_date(datum, i):
return {'fdate': datum[i][0]}
pd.set_option('display.max_colwidth', None)
data =[
['4/18/2005', [{'grantor': 'Company1'}, {'grantee': 'Company2'}]],
['3/29/2005', [{'grantor': 'Company3'}, {'grantor': 'Company1'}, {'grantor': 'Company4'}, {'grantee': 'Company5'}, {'grantee': 'Company2'}]],
['3/29/2005', [{'grantor': 'Company2'}, {'grantor': 'Company9'}, {'grantor': 'Apple'}, {'grantee': 'CompnayX'}, {'grantee': 'CompanyY'}, {'grantee': 'CompanyR'}]]
]
#This line prints the result for Task 1.
#You should iterate if you need the function to apply to other rows.
print(process_row(data, 0))
#print(process_date(data, 0))
new_data = []
for i in range(len(data)):
temp = process_row(data, i)
for index, element in enumerate(temp):
new_dict ={}
new_dict.update(process_date(data, i))
new_dict.update({'pos': index 1})
for kv in element:
new_dict.update(kv)
new_data.append([new_dict])
pp=pprint.PrettyPrinter(indent=2)
pp.pprint(new_data)
Here is a picture of the output
The first line is the output for Task 1. The rest of the output is for Task 2, note that the it includes the third element in the original data as well.
Output: https://imgur.com/a/M6TvO70