I have thousands of arrays of differing lengths (daily data) which I want to add to a "master" df by the dates. I am currently doing this via the below code. Data is a list of lists with the first list e.g i[0] referring to the specified identifier and i[1] being a dictionary of values and data points related to the specific identifier.
for i in data:
identifier = i[0]
obs = i[1]["obs"]
dates = i[1]["dates"]
tempDf = pd.DataFrame(obs, index = dates, columns = [identifier])
tempDf.index = tempDf.index.map(str)
masterDF=pd.merge(masterDF, tempDf, how = 'outer', left_index=True, right_index=True)
This is really really slow showing a significant performance drop after a thousand or so iterations. What is the best way to improve the performance? I am thinking about trying to just write the enitre code base in numpy but I like the use of pandas to check stuff later on and to graph stuff in the future (i am not that experienced with matplot lib).
A simple example of the code would be to have these 3 dfs where 1,2,3 etc are index dates.
masterdf:
col1,col2, col3
1,x,y,z
2,x,y,z
3,x,y,z
4,x,y,z
5,x,y,z
6,x,y,z
7,x,y,z
df1:
df1
5,x
6,x
7,x
df2:
df2
2,y
3,y
4,y
5,y
6,y
7,y
desired output:
col1 col2 col3 df1 df2
1 x y z NaN NaN
2 x y z NaN y
3 x y z NaN y
4 x y z NaN y
5 x y z x y
6 x y z x y
7 x y z x y
CodePudding user response:
You can use concat
. To extract the data from your data
list, maybe this would do it
res = pd.concat(
[masterDF]
[pd.DataFrame(vals['obs'], index=vals['dates'], columns=[identifier])
for identifier, vals in data],
axis=1
)
CodePudding user response:
IIUC, data
looks like:
data = [['df1', {'dates': [5, 6, 7], 'obs': ['x', 'x', 'x']}],
['df2', {'dates': [2, 3, 4, 5, 6, 7], 'obs': ['y', 'y', 'y', 'y', 'y', 'y']}]]
Then, maybe it's more efficient to modify data
first before using pandas. You could create a nested dictionary from data
where the outer keys correspond to column names and inner keys correspond to indexes (in this case, "dates"), so that you could pass it to the DataFrame constructor to create a single DataFrame. Then concat
the new DataFrame with masterdf
:
new_df = pd.DataFrame({k: dict(zip(v['dates'], v['obs'])) for k, v in data})
out = pd.concat((masterdf, new_df), axis=1)
Output:
col1 col2 col3 df1 df2
1 x y z NaN NaN
2 x y z NaN y
3 x y z NaN y
4 x y z NaN y
5 x y z x y
6 x y z x y
7 x y z x y