I'm trying to append each row of a DataFrame separately. Each row has Series and Scalar values. an example of a row would be
row = {'col1': 1, 'col2':'blah', 'col3': pd.Series(['first', 'second'])}
When I create a DataFrame from this, it looks like this
df = pd.DataFrame(row)
df
col1 col2 col3
0 1 blah first
1 1 blah second
This is what I want. The scalar values are repeated which is good. Now, some of my rows have empty Series for the column, as such:
another_row = {'col1': 45, 'col2':'more blah', 'col3': pd.Series([], dtype='object')}
When I create a new DataFrame in order to concat the two, like so
second_df = pd.DataFrame(another_row)
I get back an empty DataFrame. Which is not what I'm looking for.
>>> second_df = pd.DataFrame({'col1': 45, 'col2':'more blah', 'col3': pd.Series([], dtype='object')})
>>> second_df
Empty DataFrame
Columns: [col1, col2, col3]
Index: []
>>>
What I'm actually after is something like this
>>> second_df
>>>
col1 col2 col3
0 45 'more blah' NaN
Or something like that. Basically, I don't want the entire row to be dropped on the floor, I want the empty Series to be represented by None or NaN or something.
I don't get any errors, and nothing warns me that anything is out of the ordinary, so I have no idea why the df is behaving like this.
CodePudding user response:
You could pass an index to make it work (and get the dataframe with NaN
in the third column):
another_row = {'col1': 45, 'col2':'more blah', 'col3': pd.Series([], dtype='object')}
second_df = pd.DataFrame(another_row, index=[0])
When passing all scalars and a Series, the number of rows is determined by the length of the Series – if the length is zero, so is the number of rows. You could pass singleton lists instead of scalars so the number of rows is no longer zero:
another_row = {'col1': [45], 'col2': ['more blah'], 'col3': [np.nan]}
second_df = pd.DataFrame(another_row)
Alternatively, pass all scalars and an index like above,
another_row = {'col1': 45, 'col2': 'more blah', 'col3': np.nan}
second_df = pd.DataFrame(another_row, index=[0])
but I'd probably just do
second_df = pd.DataFrame([[45, 'more blah', np.nan]],
columns=['col1', 'col2', 'col3'])
CodePudding user response:
Ultimately, I reworked my code to avoid having this problem. My solution is as follows:
I have a function do_data_stuff()
and it used to return a pandas series, but now I have changed it to return
- a series if there's stuff in it
Series([1, 2, 3])
- or nan if it would be empty
np.nan
.
A side effect of going with this approach was that the DataFrame requires an index if only scalars are passed. "ValueError: If using all scalar values, you must pass an index"
So I can't pass index=[0]
hard coded like that because I wanted the DF to have the series determine the number of rows in the DF automatically.
row = {'col1': 1, 'col2':'blah', 'col3': pd.Series(['first', 'second'])}
df = pd.DataFrame(row)
df
col1 col2 col3
0 1 blah first
1 1 blah second
So what I ended up doing was adding a dynamic index call. I'm not sure if this is proper python, but it worked for me.
stuff = do_data_stuff()
data = pd.DataFrame({
'col1': 45,
'col2': 'very awesome stuff',
'col3': stuff
},
index= [0] if stuff is np.nan else None
)
And then I concatenated my DataFrames using the following:
data = pd.concat([data, some_other_df], ignore_index=True)
The result was a DataFrame that looks like this
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'col1': 1, 'col2':'blah', 'col3': pd.Series(['first', 'second'])})
>>> df
col1 col2 col3
0 1 blah first
1 1 blah second
>>> stuff = np.nan
>>> stuff
nan
>>> df = pd.concat([
df, pd.DataFrame(
{
'col1': 45,
'col2': 'more awesome stuff',
'col3': stuff
},
index= [0] if stuff is np.nan else None
)], ignore_index=True)
>>> df
col1 col2 col3
0 1 blah first
1 1 blah second
2 45 more awesome stuff NaN
You can replace np.nan
with anything, like ""
.