Home > other >  Why does pandas.concat() add (), to column name
Why does pandas.concat() add (), to column name

Time:05-24

I am trying to work out why the column names for pandas.concat() are in brackets.

There is a similar question here - but in my context I don't understand how this can be hapenning. It is like there is a double bracket in the assignment, but given the concatenated dataframe looks fine I cannot understand what is causing it.

The output is below the code.

import warnings
import random
import pandas as pd # dataframe manipulation
import numpy as np # linear algebra
from sklearn.preprocessing import OneHotEncoder
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

url = 'https://raw.githubusercontent.com/bryonbaker/datasets/main/SIT720/Ass4/forestfires.csv'
full_df = pd.read_csv(url)
print(f"{full_df.head()}\n")

ohe = OneHotEncoder(handle_unknown='ignore', drop=None, dtype='int')

transformed = ohe.fit_transform(full_df[['month']])
month_df = pd.DataFrame(transformed.toarray())
month_df.columns = ohe.categories_

print(month_df.head())

full_df = full_df.drop(['month'], axis=1)

result = pd.concat([full_df, month_df], axis=1)
result.head()

The full output is:

   X  Y month  day  FFMC   DMC     DC  ISI  temp  RH  wind  rain  area
0  7  5   mar  fri  86.2  26.2   94.3  5.1   8.2  51   6.7   0.0   0.0
1  7  4   oct  tue  90.6  35.4  669.1  6.7  18.0  33   0.9   0.0   0.0
2  7  4   oct  sat  90.6  43.7  686.9  6.7  14.6  33   1.3   0.0   0.0
3  8  6   mar  fri  91.7  33.3   77.5  9.0   8.3  97   4.0   0.2   0.0
4  8  6   mar  sun  89.3  51.3  102.2  9.6  11.4  99   1.8   0.0   0.0

  apr aug dec feb jan jul jun mar may nov oct sep
0   0   0   0   0   0   0   0   1   0   0   0   0
1   0   0   0   0   0   0   0   0   0   0   1   0
2   0   0   0   0   0   0   0   0   0   0   1   0
3   0   0   0   0   0   0   0   1   0   0   0   0
4   0   0   0   0   0   0   0   1   0   0   0   0
X   Y   day FFMC    DMC DC  ISI temp    RH  wind    ... (dec,)  (feb,)  (jan,)  (jul,)  (jun,)  (mar,)  (may,)  (nov,)  (oct,)  (sep,)
0   7   5   fri 86.2    26.2    94.3    5.1 8.2 51  6.7 ... 0   0   0   0   0   1   0   0   0   0
1   7   4   tue 90.6    35.4    669.1   6.7 18.0    33  0.9 ... 0   0   0   0   0   0   0   0   1   0
2   7   4   sat 90.6    43.7    686.9   6.7 14.6    33  1.3 ... 0   0   0   0   0   0   0   0   1   0
3   8   6   fri 91.7    33.3    77.5    9.0 8.3 97  4.0 ... 0   0   0   0   0   1   0   0   0   0
4   8   6   sun 89.3    51.3    102.2   9.6 11.4    99  1.8 ... 0   0   0   0   0   1   0   0   0   0
5 rows × 24 columns

CodePudding user response:

The categories are stored in a list of arrays. When you make them column names, each name becomes a one-element tuple. Change this line:

month_df.columns = ohe.categories_

to:

month_df.columns = ohe.categories_[0]
  • Related