Home > Software engineering >  Pandas : Confused when extending DataFrame vs Series (Column/Index). Why the difference?
Pandas : Confused when extending DataFrame vs Series (Column/Index). Why the difference?

Time:01-04

First off, let me say that I've already looked over various responses to similar questions, but so far, none of them has really made it clear to me why (or why not) the Series and DataFrame methodologies are different.
Also, some of the Pandas information is not clear, for example looking up Series.reindex, https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.reindex.html all the examples suddenly switch to showing examples for DataFrame not Series, but the functions don't seem to overlap exactly.

So, now to it, first with a DataFrame.

> df = pd.DataFrame(np.random.randn(6,4), index=range(6), columns=list('ABCD'))
> df
Out[544]: 
          A         B         C         D
0  0.136833 -0.974500  1.708944  0.435174
1 -0.357955 -0.775882 -0.208945  0.120617
2 -0.002479  0.508927 -0.826698 -0.904927
3  1.955611 -0.558453 -0.476321  1.043139
4 -0.399369 -0.361136 -0.096981  0.092468
5 -0.130769 -0.075684  0.788455  1.640398

Now, to add new columns, I can do something simple (2 ways, same result).

> df[['X','Y']] = (99,-99)
> df.loc[:,['X','Y']] = (99,-99)
> df
Out[557]: 
          A         B         C         D   X   Y
0  0.858615 -0.552171  1.225210 -1.700594  99 -99
1  1.062435 -1.917314  1.160043 -0.058348  99 -99
2  0.023910  1.262706 -1.924022 -0.625969  99 -99
3  1.794365  0.146491 -0.103081  0.731110  99 -99
4 -1.163691  1.429924 -0.194034  0.407508  99 -99
5  0.444909 -0.905060  0.983487 -4.149244  99 -99

Now, with a Series, I have hit a (mental?) block trying the same.
I'm going to be using a loop to construct a list of Series that will eventually be a data frame, but I want to deal with each 'row' as a Series first, (to make development easier).

> ss = pd.Series(np.random.randn(4), index=list('ABCD'))
> ss
Out[552]: 
A    0.078013
B    1.707052
C   -0.177543
D   -1.072017
dtype: float64

> ss['X','Y'] = (99,-99)
Traceback (most recent call last):
...
KeyError: "None of [Index(['X', 'Y'], dtype='object')] are in the [index]"

Same for,

> ss[['X','Y']] = (99,-99)
> ss.loc[['X','Y']] = (99,-99)
KeyError: "None of [Index(['X', 'Y'], dtype='object')] are in the [index]"

The only way I can get this working is a rather clumsy (IMHO),

> ss['X'],ss['Y'] = (99,-99)
> ss
Out[560]: 
A     0.078013
B     1.707052
C    -0.177543
D    -1.072017
X    99.000000
Y   -99.000000
dtype: float64

I did think that, perhaps, reindexing the Series to add the new indices prior to assignment might solve to problem. It would, but then I hit an issue trying to change the index.

> ss = pd.Series(np.random.randn(4), index=list('ABCD'), name='z')
> xs = pd.Series([99,-99], index=['X','Y'], name='z')

Here I can concat my 2 Series to create a new one, and I can also concat the Series indices, eg,

> ss.index.append(xs.index)
Index(['A', 'B', 'C', 'D', 'X', 'Y'], dtype='object')

But I can't extend the current index with,

> ss.index = ss.index.append(xs.index)
ValueError: Length mismatch: Expected axis has 4 elements, new values have 6 elements

So, what intuitive leap must I make to understand why the former Series methods don't work, but (what looks like an equivalent) DataFrame method does work? It makes passing multiple outputs back from a function into new Series elements a bit clunky. I can't 'on the fly' make up new Series index names to insert values into my exiting Series object.

CodePudding user response:

I don't think you can directly modify the Series in place to add multiple values at once.

If having a new object is not an issue:

ss = pd.Series(np.random.randn(4), index=list('ABCD'), name='z')
xs = pd.Series([99,-99], index=['X','Y'], name='z')

# new object with updated index
ss = ss.reindex(ss.index.union(xs.index))
ss.update(xs)

Output:

A    -0.369182
B    -0.239379
C     1.099660
D     0.655264
X    99.000000
Y   -99.000000
Name: z, dtype: float64

in place alternative using a function:

ss = pd.Series(np.random.randn(4), index=list('ABCD'), name='z')
xs = pd.Series([99,-99], index=['X','Y'], name='z')

def extend(s1, s2):
    s1.update(s2) # update common indices
    # add others
    for idx, val in s2[s2.index.difference(s1.index)].items():
        s1[idx] = val
        
extend(ss, xs)

Updated ss:

A     0.279925
B    -0.098150
C     0.910179
D     0.317218
X    99.000000
Y   -99.000000
Name: z, dtype: float64

CodePudding user response:

While I have accepted @mozway's answer above since it nicely handles extending the Series even when there are possible index conflicts, I'm adding this 'answer' to demonstrate my point about the inconsistency in the extend operation between Series and DataFrame.

If I create my Series as single row DataFrames, as below, I can now extend the 'series' as I expected.

z=pd.Index(['z'])
ss = pd.DataFrame(np.random.randn(1,4), columns=list('ABCD'),index=z)
xs = pd.DataFrame([[99,-99]], columns=['X','Y'],index=z)

ss
Out[619]: 
          A         B         C         D
z  1.052589 -0.337622 -0.791994 -0.266888
ss[['x','y']] = xs
ss
Out[620]: 
          A         B         C         D   x   y
z  1.052589 -0.337622 -0.791994 -0.266888  99 -99
type(ss)
Out[621]: pandas.core.frame.DataFrame

Note that, as a DataFrame, I don't even need a Series for the extend object.

ss[['X','Y']] = [123,-123]

ss
Out[633]: 
          A         B         C         D    X    Y
z  0.600981 -0.473031  0.216941  0.255252  123 -123

So I've simply extended the DataFrame, but it's still a DataFrame of 1 row. I can now either 'squeeze' the DataFrame,

zz1=ss.squeeze()

type(zz1)
Out[624]: pandas.core.series.Series
zz1
Out[625]: 
A     1.052589
B    -0.337622
C    -0.791994
D    -0.266888
x    99.000000
y   -99.000000
Name: z, dtype: float64

Alternatively, I can use 'iloc[0]' to get a Series directly. Note that 'loc' will return a DataFrame not a Series and will still require 'squeezing'.

zz2=ss.iloc[0]

type(zz2)
Out[629]: pandas.core.series.Series

zz2
Out[630]: 
A     1.052589
B    -0.337622
C    -0.791994
D    -0.266888
x    99.000000
y   -99.000000
Name: z, dtype: float64

Please note, I'm not a Pandas 'wizard' so there may be other insights that I lack.

  • Related