Home > Back-end >  Null dataframe in statistic function
Null dataframe in statistic function

Time:10-27

I'm having an issue with a function to extract main statistics from a dataframe: median, std, kurtosis, etc.

It keeps returning null, and i can't figure out why. My code is as below:

import pandas as pd

df = pd.read_excel("file.xlsx")

def estatistics_from_df(df):
    df_stats = pd.DataFrame()
    df_stats['Colunas'] = df.columns
    df_stats['Tipos'] = df.dtypes
    df_stats['Count'] = df.count()
    df_stats['Unique'] = df.nunique()
    df_stats['Nulos'] = df.isnull().sum()
    df_stats['Min'] = df.min()
    df_stats['Max'] = df.max()
    df_stats['Mean'] = df.mean()
    df_stats['Median'] = df.median()
    df_stats['Std'] = df.std()
    df_stats['Variance'] = df.var()
    df_stats['Kurtosis'] = df.kurtosis()
    df_stats['Skewness'] = df.skew()
    df_stats['Entropy'] = df.nunique()
    df_stats['Missing'] = df.isnull().sum()
    return df_stats

df_stats = estatistics_from_df(df)

The final dataframe is:

  Colunas Tipos  Count  Unique  Nulos  Min  Max  Mean  Median  Std  Variance  Kurtosis  Skewness  Entropy  Missing
0      ID   NaN    NaN     NaN    NaN  NaN  NaN   NaN     NaN  NaN       NaN       NaN       NaN      NaN      NaN
1     Min   NaN    NaN     NaN    NaN  NaN  NaN   NaN     NaN  NaN       NaN       NaN       NaN      NaN      NaN
2     End   NaN    NaN     NaN    NaN  NaN  NaN   NaN     NaN  NaN       NaN       NaN       NaN      NaN      NaN

The original dataframe is:

    ID        Min        End
0   46 2020-01-31 2021-07-15
1  115 2020-09-05 2020-11-25
2  126 2021-10-04 2022-10-03
3  327 2021-07-24 2023-05-27
4  375 2021-06-10 2021-06-17
```

CodePudding user response:

Each of your statistics functions returns a series with the index as the column names of the DataFrame. You should therefore set the index in the first line of your function.

Try:

def estatistics_from_df(df):
    df_stats = pd.DataFrame(index=df.columns)
    df_stats['Colunas'] = df.columns
    df_stats['Tipos'] = df.dtypes
    df_stats['Count'] = df.count()
    df_stats['Unique'] = df.nunique()
    df_stats['Nulos'] = df.isnull().sum()
    df_stats['Min'] = df.min()
    df_stats['Max'] = df.max()
    df_stats['Mean'] = df.mean()
    df_stats['Median'] = df.median()
    df_stats['Std'] = df.std()
    df_stats['Variance'] = df.var()
    df_stats['Kurtosis'] = df.kurtosis()
    df_stats['Skewness'] = df.skew()
    df_stats['Entropy'] = df.nunique()
    df_stats['Missing'] = df.isnull().sum()
    return df_stats.reset_index(drop=True)

>>> estatistics_from_df(df)

  Colunas           Tipos  Count  Unique  Nulos                  Min  \
0      ID           int64      5       5      0                   46   
1     Min  datetime64[ns]      5       5      0  2020-01-31 00:00:00   
2     End  datetime64[ns]      5       5      0  2020-11-25 00:00:00   

                   Max   Mean  Median                          Std  Variance  \
0                  375  197.8   126.0                   144.175934   20786.7   
1  2021-10-04 00:00:00    NaN     NaN  256 days 13:26:52.018691364       NaN   
2  2023-05-27 00:00:00    NaN     NaN  376 days 07:10:06.345801880       NaN   

   Kurtosis  Skewness  Entropy  Missing  
0 -2.592627  0.456711        5        0  
1       NaN       NaN        5        0  
2       NaN       NaN        5        0  
  • Related