Pandas DataFrame defaults to NaN values when columns are specified within initialization, but not wh-CodePudding

Recently I ran into a unique behavior of Pandas DataFrames, which I am trying to understand.

If we use the following code:

col_headers = ['aaa', 'aaa.1', 'aaab', 'aaac', 'aaac.1', 'aaac.2']
data = ['aaa', 'aaa', 'aaab', 'aaac', 'aaac', 'aaac']
df = pd.DataFrame(data=[pd.Series(data)], dtype=object)
df.columns = col_headers

we get the desired results

However, if we specify the columns directly in the DataFrame initialization like this:

col_headers = ['aaa', 'aaa.1', 'aaab', 'aaac', 'aaac.1', 'aaac.2']
data = ['aaa', 'aaa', 'aaab', 'aaac', 'aaac', 'aaac']
df = pd.DataFrame(data=[pd.Series(data)], columns = col_headers, dtype=object)

we get a DataFrame with NaN values

What causes this to happen?

CodePudding user response：

If you are supplying the columns param in pd.DataFrame function, you will have to convert the data to a list. Because pd.Series contains index names which is treated as header for the input data and when you supply the column param alongwith, it tries to look for those index names which has been provided in the columns list and since it does not find any, it inserts nan for those columns.

df = pd.DataFrame(data = [data], columns = col_headers, dtype=object)

data can contain ndarray , Iterable, dict, or DataFrame. Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order. If a dict contains Series which have an index defined, it is aligned by its index.

Because when you use series, your data contains column labels(like below)

[pd.Series(data)]
[0     aaa
1     aaa
2    aaab
3    aaac
4    aaac
5    aaac
dtype: object]

If data contains column labels, it will perform column selection instead.

For POC,

Renaming col_headers

You can try naming the column header as numbers and do similarly like you are doing now, the row insertion should work fine now.

col_headers = [0,1,2,3,4,5]
df = pd.DataFrame(data=[pd.Series(data)], columns = col_headers, dtype=object)
df

Output :

     0    1     2     3     4     5
0  aaa  aaa  aaab  aaac  aaac  aaac

Renaming Pandas Series Index

Similarly, it should work when you rename the index of the series with your column name.

data_series = pd.Series(data)
index_ = ['aaa', 'aaa.1', 'aaab', 'aaac', 'aaac.1', 'aaac.2']
data_series.index = index_
df = pd.DataFrame(data=[pd.Series(data_series)], columns = col_headers, dtype=object)

Output :

   aaa aaa.1  aaab  aaac aaac.1 aaac.2
0  aaa   aaa  aaab  aaac   aaac   aaac