I'm retrieving data from DB, and I so far I reached a list of dictionaries. In each dict I'm interested in the keys title
, authors
, publishedAt
and tickers
. Some dicts have the tickers
empty, so I ignore those. Some dicts may have a list of multiple tickers
and multiple authors
, so in this case I'd like my DataFrame to have one row for each author and ticker, and the title
and publishedAt
must repeat thoughout those lines.
Example of dictionary:
dict = {'title':'Report',
'authors':[{'name': 'Mike'},{'name':'John'}],
'tickers':[{'code':'GOOGL'},{'code':'AAPL'}],
'publishedAt':'2022-03-31'
}
The desired outcome would be something like:
Author | Date | Title | Ticker |
---|---|---|---|
Mike | 2022-03-01 | Report | GOOGL |
Mike | 2022-03-01 | Report | AAPL |
John | 2022-03-01 | Report | GOOGL |
John | 2022-03-01 | Report | AAPL |
I was trying to do something like this, where publications is my list of dicts:
df = pd.DataFrame(columns=['Author','Date','Title','Ticker'])
for publication in publications:
qty_tickers = len(publication['tickers'])
qty_authors = len(publication['authors'])
qty_total = qty_tickers * qty_authors
if qty_total >= 1:
start = len(df) 1
df.iloc[start:start qty_total,1:2] = [publication['publishedAt'],publication['title']]
This last bit is already not working, as I receive ValueError: could not broadcast input array from shape (2,) into shape (0,1)
.
And I didn't even know how to start with the Authors and Tickers.
CodePudding user response:
One example how to get this result would be
for author in publication['authors']:
for ticker in publication['tickers']:
df = pd.concat([
df,
pd.DataFrame({
'Author': author['name'],
'Date': publication['publishedAt'],
'Title': publication['title'],
'Ticker': ticker['code']
}, index=[0])
])
Using pd.concat()
from the documentation, this concatenates the original DataFrame df
with the newly created DataFrame from a dict. Since there was no index, I adjusted the function call to use the first column. You can of course adjust that to your liking.
CodePudding user response:
Instead of populating the DataFrame dynamically, you should first perform the data wrangling and only then construct the DataFrame with it. In this case, it makes sense to create a list of records/ dictionaries (the rows of the DataFrame) based on publications
import pandas as pd
publications = [
{
'title':'Report',
'authors':[{'name': 'Mike'},{'name':'John'}],
'tickers':[{'code':'GOOGL'},{'code':'AAPL'}],
'publishedAt':'2022-03-31'
},
{
'title':'Something',
'authors':[{'name': 'Someone'},{'name':'Other'}],
'tickers':[{'code':'ABC'},{'code':'DEF'}],
'publishedAt':'2022-03-31'
}
]
records = []
for publication in publications:
for author in publication['authors']:
for ticker in publication['tickers']:
rec = {
'Author': author['name'],
'Date': publication['publishedAt'],
'Title': publication['title'],
'Ticker': ticker['code']
}
records.append(rec)
df = pd.DataFrame(records)
Output:
>>> df
Author Date Title Ticker
0 Mike 2022-03-31 Report GOOGL
1 Mike 2022-03-31 Report AAPL
2 John 2022-03-31 Report GOOGL
3 John 2022-03-31 Report AAPL
4 Someone 2022-03-31 Something ABC
5 Someone 2022-03-31 Something DEF
6 Other 2022-03-31 Something ABC
7 Other 2022-03-31 Something DEF