how can I efficiently add arrays of differing lengths to a df-CodePudding

I have thousands of arrays of differing lengths (daily data) which I want to add to a "master" df by the dates. I am currently doing this via the below code. Data is a list of lists with the first list e.g i[0] referring to the specified identifier and i[1] being a dictionary of values and data points related to the specific identifier.

for i in data:
    identifier = i[0]
    obs = i[1]["obs"]
    dates = i[1]["dates"]   
    tempDf = pd.DataFrame(obs, index = dates, columns = [identifier])
    tempDf.index = tempDf.index.map(str)    
    masterDF=pd.merge(masterDF, tempDf, how = 'outer', left_index=True, right_index=True)

This is really really slow showing a significant performance drop after a thousand or so iterations. What is the best way to improve the performance? I am thinking about trying to just write the enitre code base in numpy but I like the use of pandas to check stuff later on and to graph stuff in the future (i am not that experienced with matplot lib).

A simple example of the code would be to have these 3 dfs where 1,2,3 etc are index dates.

masterdf:

  col1,col2, col3
1,x,y,z
2,x,y,z
3,x,y,z
4,x,y,z
5,x,y,z
6,x,y,z
7,x,y,z

df1:

  df1
5,x
6,x
7,x

df2:

  df2
2,y
3,y
4,y
5,y
6,y
7,y

desired output:

   col1 col2  col3  df1  df2
1     x    y     z  NaN  NaN
2     x    y     z  NaN    y
3     x    y     z  NaN    y
4     x    y     z  NaN    y
5     x    y     z    x    y
6     x    y     z    x    y
7     x    y     z    x    y

CodePudding user response：

You can use concat. To extract the data from your data list, maybe this would do it

res = pd.concat(
    [masterDF]
      [pd.DataFrame(vals['obs'], index=vals['dates'], columns=[identifier])
      for identifier, vals in data],
    axis=1
)

CodePudding user response：

IIUC, data looks like:

data = [['df1', {'dates': [5, 6, 7], 'obs': ['x', 'x', 'x']}],
        ['df2', {'dates': [2, 3, 4, 5, 6, 7], 'obs': ['y', 'y', 'y', 'y', 'y', 'y']}]]

Then, maybe it's more efficient to modify data first before using pandas. You could create a nested dictionary from data where the outer keys correspond to column names and inner keys correspond to indexes (in this case, "dates"), so that you could pass it to the DataFrame constructor to create a single DataFrame. Then concat the new DataFrame with masterdf:

new_df = pd.DataFrame({k: dict(zip(v['dates'], v['obs'])) for k, v in data})
out = pd.concat((masterdf, new_df), axis=1)

Output:

  col1 col2 col3  df1  df2
1    x    y    z  NaN  NaN
2    x    y    z  NaN    y
3    x    y    z  NaN    y
4    x    y    z  NaN    y
5    x    y    z    x    y
6    x    y    z    x    y
7    x    y    z    x    y