Home > Enterprise >  How to remove zeros in dataframe after being created from dictionary?
How to remove zeros in dataframe after being created from dictionary?

Time:04-12

I have this dictionary with descriptive statistics of the data:

import pandas as pd


def summary_table(df):
    """
    Return a summary table with the descriptive statistics about the dataframe.
    """

    summary = {
        "Number of Days": [len(df)],
        "Missing Cells": [df.isnull().sum().sum()],
        "Missing Cells (%)": [round(df.isnull().sum().sum() / df.shape[0] * 100, 2)],
        "Duplicated Rows": [df.duplicated().sum()],
        "Duplicated Rows (%)": [round(df.duplicated().sum() / df.shape[0] * 100, 2)],
        "Length of Categorical Variables": [len([i for i in df.columns if df[i].dtype == object])],
        "Length of Numerical Variables": [len([i for i in df.columns if df[i].dtype != object])]
    }
    print(summary.items())
    df = pd.DataFrame(summary.items(), columns=['Description', 'Value'])
    df = df.applymap(lambda x: x[0] if isinstance(x, list) else x)
    return df

df=pd.read_csv('test.csv')
df2=summary_table(df)
print(df2)

and this creates the output:

dict_items([('Number of Days', [434]), ('Missing Cells', [108]), ('Missing Cells (%)', [24.88]), ('Duplicated Rows', [0]), ('Duplicated Rows (%)', [0.0]), ('Length of Categorical Variables', [1]), ('Length of Numerical Variables', [11])])
                       Description   Value
0                   Number of Days  434.00
1                    Missing Cells  108.00
2                Missing Cells (%)   24.88
3                  Duplicated Rows    0.00
4              Duplicated Rows (%)    0.00
5  Length of Categorical Variables    1.00
6    Length of Numerical Variables   11.00

When printing the dictionary items, the data doesn't contain zeros at the end. However, the dataframe cells contain extra zeros, which cause confusion. How could I fix this issue and remove the extra zeros in the dataframe conversion from dictionary?

CodePudding user response:

Use an object dtype to enable mixed int/floats. Don't use lists as container:

def summary_table(df):
    """
    Return a summary table with the descriptive statistics about the dataframe.
    """
    nulls = df.isnull().sum().sum()
    dups = df.duplicated().sum()
    summary = {
        "Number of Days": len(df),
        "Missing Cells": nulls,
        "Missing Cells (%)": round(nulls / df.shape[0] * 100, 2),
        "Duplicated Rows": dups,
        "Duplicated Rows (%)": round(dups / df.shape[0] * 100, 2),
        "Length of Categorical Variables": len([i for i in df.columns if df[i].dtype == object]),
        "Length of Numerical Variables": len([i for i in df.columns if df[i].dtype != object])
    }
    df = pd.DataFrame(summary.items(), columns=['Description', 'Value'], dtype=object)
    return df

Example:

print(summary_table(df))
                       Description Value
0                   Number of Days     8
1                    Missing Cells     0
2                Missing Cells (%)   0.0
3                  Duplicated Rows     0
4              Duplicated Rows (%)   0.0
5  Length of Categorical Variables     2
6    Length of Numerical Variables     1

You could further improve your code to avoid computing duplicated indicators.

For instance:

nulls = df.isnull().sum().sum()
...
        "Missing Cells": [nulls],
        "Missing Cells (%)": [nulls / df.shape[0] * 100, 2)
...
  • Related