How do I create a DataFrame with an empty pandas Series in a column?-CodePudding

I'm trying to append each row of a DataFrame separately. Each row has Series and Scalar values. an example of a row would be

row = {'col1': 1, 'col2':'blah', 'col3': pd.Series(['first', 'second'])}

When I create a DataFrame from this, it looks like this

df = pd.DataFrame(row)
df
   col1  col2    col3
0     1  blah   first
1     1  blah  second

This is what I want. The scalar values are repeated which is good. Now, some of my rows have empty Series for the column, as such:

another_row = {'col1': 45, 'col2':'more blah', 'col3': pd.Series([], dtype='object')}

When I create a new DataFrame in order to concat the two, like so

second_df = pd.DataFrame(another_row)

I get back an empty DataFrame. Which is not what I'm looking for.

>>> second_df = pd.DataFrame({'col1': 45, 'col2':'more blah', 'col3': pd.Series([], dtype='object')})
>>> second_df
Empty DataFrame
Columns: [col1, col2, col3]
Index: []
>>>

What I'm actually after is something like this

>>> second_df
>>> 
  col1   col2         col3
0 45    'more blah'   NaN

Or something like that. Basically, I don't want the entire row to be dropped on the floor, I want the empty Series to be represented by None or NaN or something.

I don't get any errors, and nothing warns me that anything is out of the ordinary, so I have no idea why the df is behaving like this.

CodePudding user response：

You could pass an index to make it work (and get the dataframe with NaN in the third column):

another_row = {'col1': 45, 'col2':'more blah', 'col3': pd.Series([], dtype='object')}
second_df = pd.DataFrame(another_row, index=[0])

When passing all scalars and a Series, the number of rows is determined by the length of the Series – if the length is zero, so is the number of rows. You could pass singleton lists instead of scalars so the number of rows is no longer zero:

another_row = {'col1': [45], 'col2': ['more blah'], 'col3': [np.nan]}
second_df = pd.DataFrame(another_row)

Alternatively, pass all scalars and an index like above,

another_row = {'col1': 45, 'col2': 'more blah', 'col3': np.nan}
second_df = pd.DataFrame(another_row, index=[0])

but I'd probably just do

second_df = pd.DataFrame([[45, 'more blah', np.nan]], 
                         columns=['col1', 'col2', 'col3'])

CodePudding user response：

Ultimately, I reworked my code to avoid having this problem. My solution is as follows:

I have a function do_data_stuff() and it used to return a pandas series, but now I have changed it to return

a series if there's stuff in it Series([1, 2, 3])
or nan if it would be empty np.nan.

A side effect of going with this approach was that the DataFrame requires an index if only scalars are passed. "ValueError: If using all scalar values, you must pass an index"

So I can't pass index=[0] hard coded like that because I wanted the DF to have the series determine the number of rows in the DF automatically.

row = {'col1': 1, 'col2':'blah', 'col3': pd.Series(['first', 'second'])}
df = pd.DataFrame(row)
df
   col1  col2    col3
0     1  blah   first
1     1  blah  second

So what I ended up doing was adding a dynamic index call. I'm not sure if this is proper python, but it worked for me.

stuff = do_data_stuff()
data = pd.DataFrame({
         'col1': 45,
         'col2': 'very awesome stuff',
         'col3': stuff
       }, 
       index= [0] if stuff is np.nan else None
    )

And then I concatenated my DataFrames using the following:

data = pd.concat([data, some_other_df], ignore_index=True)

The result was a DataFrame that looks like this

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'col1': 1, 'col2':'blah', 'col3': pd.Series(['first', 'second'])})
>>> df
   col1  col2    col3
0     1  blah   first
1     1  blah  second
>>> stuff = np.nan
>>> stuff
nan
>>> df = pd.concat([
    df, pd.DataFrame(
            {
                'col1': 45,
                'col2': 'more awesome stuff',
                'col3': stuff
            },
            index= [0] if stuff is np.nan else None
        )], ignore_index=True)
>>> df
   col1                col2    col3
0     1                blah   first
1     1                blah  second
2    45  more awesome stuff     NaN

You can replace np.nan with anything, like "".