Home > Net >  How do I create a DataFrame with an empty pandas Series in a column?
How do I create a DataFrame with an empty pandas Series in a column?

Time:05-27

I'm trying to append each row of a DataFrame separately. Each row has Series and Scalar values. an example of a row would be

row = {'col1': 1, 'col2':'blah', 'col3': pd.Series(['first', 'second'])}

When I create a DataFrame from this, it looks like this

df = pd.DataFrame(row)
df
   col1  col2    col3
0     1  blah   first
1     1  blah  second

This is what I want. The scalar values are repeated which is good. Now, some of my rows have empty Series for the column, as such:

another_row = {'col1': 45, 'col2':'more blah', 'col3': pd.Series([], dtype='object')}

When I create a new DataFrame in order to concat the two, like so

second_df = pd.DataFrame(another_row)

I get back an empty DataFrame. Which is not what I'm looking for.

>>> second_df = pd.DataFrame({'col1': 45, 'col2':'more blah', 'col3': pd.Series([], dtype='object')})
>>> second_df
Empty DataFrame
Columns: [col1, col2, col3]
Index: []
>>>

What I'm actually after is something like this

>>> second_df
>>> 
  col1   col2         col3
0 45    'more blah'   NaN

Or something like that. Basically, I don't want the entire row to be dropped on the floor, I want the empty Series to be represented by None or NaN or something.

I don't get any errors, and nothing warns me that anything is out of the ordinary, so I have no idea why the df is behaving like this.

CodePudding user response:

You could pass an index to make it work (and get the dataframe with NaN in the third column):

another_row = {'col1': 45, 'col2':'more blah', 'col3': pd.Series([], dtype='object')}
second_df = pd.DataFrame(another_row, index=[0])

When passing all scalars and a Series, the number of rows is determined by the length of the Series – if the length is zero, so is the number of rows. You could pass singleton lists instead of scalars so the number of rows is no longer zero:

another_row = {'col1': [45], 'col2': ['more blah'], 'col3': [np.nan]}
second_df = pd.DataFrame(another_row)

Alternatively, pass all scalars and an index like above,

another_row = {'col1': 45, 'col2': 'more blah', 'col3': np.nan}
second_df = pd.DataFrame(another_row, index=[0])

but I'd probably just do

second_df = pd.DataFrame([[45, 'more blah', np.nan]], 
                         columns=['col1', 'col2', 'col3'])

CodePudding user response:

Ultimately, I reworked my code to avoid having this problem. My solution is as follows:

I have a function do_data_stuff() and it used to return a pandas series, but now I have changed it to return

  • a series if there's stuff in it Series([1, 2, 3])
  • or nan if it would be empty np.nan.

A side effect of going with this approach was that the DataFrame requires an index if only scalars are passed. "ValueError: If using all scalar values, you must pass an index"

So I can't pass index=[0] hard coded like that because I wanted the DF to have the series determine the number of rows in the DF automatically.

row = {'col1': 1, 'col2':'blah', 'col3': pd.Series(['first', 'second'])}
df = pd.DataFrame(row)
df
   col1  col2    col3
0     1  blah   first
1     1  blah  second

So what I ended up doing was adding a dynamic index call. I'm not sure if this is proper python, but it worked for me.

stuff = do_data_stuff()
data = pd.DataFrame({
         'col1': 45,
         'col2': 'very awesome stuff',
         'col3': stuff
       }, 
       index= [0] if stuff is np.nan else None
    )

And then I concatenated my DataFrames using the following:

data = pd.concat([data, some_other_df], ignore_index=True)

The result was a DataFrame that looks like this

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'col1': 1, 'col2':'blah', 'col3': pd.Series(['first', 'second'])})
>>> df
   col1  col2    col3
0     1  blah   first
1     1  blah  second
>>> stuff = np.nan
>>> stuff
nan
>>> df = pd.concat([
    df, pd.DataFrame(
            {
                'col1': 45,
                'col2': 'more awesome stuff',
                'col3': stuff
            },
            index= [0] if stuff is np.nan else None
        )], ignore_index=True)
>>> df
   col1                col2    col3
0     1                blah   first
1     1                blah  second
2    45  more awesome stuff     NaN

You can replace np.nan with anything, like "".

  • Related