In pandas, replace table column with Series while joining indexes-CodePudding

I have a table with preexisting columns, and I want to entirely replace some of those columns with values from a series. The tricky part is that each series will have different indexes and I need to add these varying indexes to the table as necessary, like doing a join/merge operation.

For example, this code generates a table and 5 series where each series only has a subset of the indexes.

import random
cols=['a', 'b', 'c', 'd', 'e', 'f', 'g']
table = pd.DataFrame(columns=cols)
series = []
for i in range(5):
    series.append(
        pd.Series(
            np.random.randint(0, 3, 2)*10,
            index=pd.Index(random.sample(range(3), 2))
        )
    )
series

Output:

[1    10
 2     0
 dtype: int32,
 2     0
 0    20
 dtype: int32,
 2    20
 1     0
 dtype: int32,
 2     0
 0    10
 dtype: int32,
 1    20
 2    10
 dtype: int32]

But when I try to replace columns of the table with the series, a simple assignment doesn't work

for i in range(5):
    col = cols[i]
    table[col] = series[i]
table

Output:

    a   b   c   d   e   f   g
1   10  NaN 0   NaN 20  NaN NaN
2   0   0   20  0   10  NaN NaN

because the assignment won't add any more indexes after the first series is assigned

Other things I've tried:

combine or combine_first gives the same result as above. (table[col] = table[col].combine(series[i], lambda a, b: b) and table[col] = series[i].combine_first(table[col]))
pd.concat doesn't work either because of duplicate labels (table[col] = pd.concat([table[col], series[i]]) gives ValueError: cannot reindex on an axis with duplicate labels) and I can't just drop the duplicates because other columns may already have values in those indexes
DataFrame.update won't work since it only takes indexes from the table (join='left'). I need to add indexes from the series to the table as necessary.

Of course, I can always do something like this:

table = table.join(series[i].rename('new'), how='outer')
table[col] = table.pop('new')

which gives the correct result:

    a   b   c   d   e   f   g
0   NaN     20.0    NaN     10.0    NaN     NaN     NaN
1   10.0    NaN     0.0     NaN     20.0    NaN     NaN
2   0.0     0.0     20.0    0.0     10.0    NaN     NaN

But it's doing it in quite a roundabout way, and still isn't robust to column name collisions, so you'd have to add a handful more code to fiddle with column names and protect against that. This produces quite verbose and ugly code for what is a conceptually a very simple operation, that I believe there must be a better way of doing it.

CodePudding user response：

pd.concat should work along the column axis:

out = pd.concat(series, axis=1)
print(out)

# Output
      0     1    2    3     4
0  10.0   0.0  0.0  NaN  10.0
1   NaN  10.0  NaN  0.0  20.0
2   0.0   NaN  0.0  0.0   NaN

CodePudding user response：

You could try constructing the dataframe using a dict comprehension like this:

series:

[0    10
 1     0
 dtype: int64,
 0    0
 1    0
 dtype: int64,
 2    20
 0     0
 dtype: int64,
 0    20
 2     0
 dtype: int64,
 0    0
 1    0
 dtype: int64]

code:

table = pd.DataFrame({
    col: series[i]
    for i, col in enumerate(cols)
    if i < len(series)
})
table

output:

      a    b     c     d    e
0  10.0  0.0   0.0  20.0  0.0
1   0.0  0.0   NaN   NaN  0.0
2   NaN  NaN  20.0   0.0  NaN

If you really need the nan columns at the end you could do:

table = pd.DataFrame({
    col: series[i] if i < len(series) else np.nan
    for i, col in enumerate(cols)
})

Output:

      a    b     c     d    e   f   g
0  10.0  0.0   0.0  20.0  0.0 NaN NaN
1   0.0  0.0   NaN   NaN  0.0 NaN NaN
2   NaN  NaN  20.0   0.0  NaN NaN NaN