I am trying to work out why the column names for pandas.concat() are in brackets.
There is a similar question here - but in my context I don't understand how this can be hapenning. It is like there is a double bracket in the assignment, but given the concatenated dataframe looks fine I cannot understand what is causing it.
The output is below the code.
import warnings
import random
import pandas as pd # dataframe manipulation
import numpy as np # linear algebra
from sklearn.preprocessing import OneHotEncoder
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
url = 'https://raw.githubusercontent.com/bryonbaker/datasets/main/SIT720/Ass4/forestfires.csv'
full_df = pd.read_csv(url)
print(f"{full_df.head()}\n")
ohe = OneHotEncoder(handle_unknown='ignore', drop=None, dtype='int')
transformed = ohe.fit_transform(full_df[['month']])
month_df = pd.DataFrame(transformed.toarray())
month_df.columns = ohe.categories_
print(month_df.head())
full_df = full_df.drop(['month'], axis=1)
result = pd.concat([full_df, month_df], axis=1)
result.head()
The full output is:
X Y month day FFMC DMC DC ISI temp RH wind rain area
0 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 0.0
1 7 4 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 0.0
2 7 4 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 0.0
3 8 6 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 0.0
4 8 6 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0 0.0
apr aug dec feb jan jul jun mar may nov oct sep
0 0 0 0 0 0 0 0 1 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 1 0
2 0 0 0 0 0 0 0 0 0 0 1 0
3 0 0 0 0 0 0 0 1 0 0 0 0
4 0 0 0 0 0 0 0 1 0 0 0 0
X Y day FFMC DMC DC ISI temp RH wind ... (dec,) (feb,) (jan,) (jul,) (jun,) (mar,) (may,) (nov,) (oct,) (sep,)
0 7 5 fri 86.2 26.2 94.3 5.1 8.2 51 6.7 ... 0 0 0 0 0 1 0 0 0 0
1 7 4 tue 90.6 35.4 669.1 6.7 18.0 33 0.9 ... 0 0 0 0 0 0 0 0 1 0
2 7 4 sat 90.6 43.7 686.9 6.7 14.6 33 1.3 ... 0 0 0 0 0 0 0 0 1 0
3 8 6 fri 91.7 33.3 77.5 9.0 8.3 97 4.0 ... 0 0 0 0 0 1 0 0 0 0
4 8 6 sun 89.3 51.3 102.2 9.6 11.4 99 1.8 ... 0 0 0 0 0 1 0 0 0 0
5 rows × 24 columns
CodePudding user response:
The categories are stored in a list of arrays. When you make them column names, each name becomes a one-element tuple. Change this line:
month_df.columns = ohe.categories_
to:
month_df.columns = ohe.categories_[0]