I'm having an issue with a function to extract main statistics from a dataframe: median, std, kurtosis, etc.
It keeps returning null, and i can't figure out why. My code is as below:
import pandas as pd
df = pd.read_excel("file.xlsx")
def estatistics_from_df(df):
df_stats = pd.DataFrame()
df_stats['Colunas'] = df.columns
df_stats['Tipos'] = df.dtypes
df_stats['Count'] = df.count()
df_stats['Unique'] = df.nunique()
df_stats['Nulos'] = df.isnull().sum()
df_stats['Min'] = df.min()
df_stats['Max'] = df.max()
df_stats['Mean'] = df.mean()
df_stats['Median'] = df.median()
df_stats['Std'] = df.std()
df_stats['Variance'] = df.var()
df_stats['Kurtosis'] = df.kurtosis()
df_stats['Skewness'] = df.skew()
df_stats['Entropy'] = df.nunique()
df_stats['Missing'] = df.isnull().sum()
return df_stats
df_stats = estatistics_from_df(df)
The final dataframe is:
Colunas Tipos Count Unique Nulos Min Max Mean Median Std Variance Kurtosis Skewness Entropy Missing
0 ID NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Min NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 End NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
The original dataframe is:
ID Min End
0 46 2020-01-31 2021-07-15
1 115 2020-09-05 2020-11-25
2 126 2021-10-04 2022-10-03
3 327 2021-07-24 2023-05-27
4 375 2021-06-10 2021-06-17
```
CodePudding user response:
Each of your statistics functions returns a series with the index as the column names of the DataFrame. You should therefore set the index in the first line of your function.
Try:
def estatistics_from_df(df):
df_stats = pd.DataFrame(index=df.columns)
df_stats['Colunas'] = df.columns
df_stats['Tipos'] = df.dtypes
df_stats['Count'] = df.count()
df_stats['Unique'] = df.nunique()
df_stats['Nulos'] = df.isnull().sum()
df_stats['Min'] = df.min()
df_stats['Max'] = df.max()
df_stats['Mean'] = df.mean()
df_stats['Median'] = df.median()
df_stats['Std'] = df.std()
df_stats['Variance'] = df.var()
df_stats['Kurtosis'] = df.kurtosis()
df_stats['Skewness'] = df.skew()
df_stats['Entropy'] = df.nunique()
df_stats['Missing'] = df.isnull().sum()
return df_stats.reset_index(drop=True)
>>> estatistics_from_df(df)
Colunas Tipos Count Unique Nulos Min \
0 ID int64 5 5 0 46
1 Min datetime64[ns] 5 5 0 2020-01-31 00:00:00
2 End datetime64[ns] 5 5 0 2020-11-25 00:00:00
Max Mean Median Std Variance \
0 375 197.8 126.0 144.175934 20786.7
1 2021-10-04 00:00:00 NaN NaN 256 days 13:26:52.018691364 NaN
2 2023-05-27 00:00:00 NaN NaN 376 days 07:10:06.345801880 NaN
Kurtosis Skewness Entropy Missing
0 -2.592627 0.456711 5 0
1 NaN NaN 5 0
2 NaN NaN 5 0