I have this dictionary with descriptive statistics of the data:
import pandas as pd
def summary_table(df):
"""
Return a summary table with the descriptive statistics about the dataframe.
"""
summary = {
"Number of Days": [len(df)],
"Missing Cells": [df.isnull().sum().sum()],
"Missing Cells (%)": [round(df.isnull().sum().sum() / df.shape[0] * 100, 2)],
"Duplicated Rows": [df.duplicated().sum()],
"Duplicated Rows (%)": [round(df.duplicated().sum() / df.shape[0] * 100, 2)],
"Length of Categorical Variables": [len([i for i in df.columns if df[i].dtype == object])],
"Length of Numerical Variables": [len([i for i in df.columns if df[i].dtype != object])]
}
print(summary.items())
df = pd.DataFrame(summary.items(), columns=['Description', 'Value'])
df = df.applymap(lambda x: x[0] if isinstance(x, list) else x)
return df
df=pd.read_csv('test.csv')
df2=summary_table(df)
print(df2)
and this creates the output:
dict_items([('Number of Days', [434]), ('Missing Cells', [108]), ('Missing Cells (%)', [24.88]), ('Duplicated Rows', [0]), ('Duplicated Rows (%)', [0.0]), ('Length of Categorical Variables', [1]), ('Length of Numerical Variables', [11])])
Description Value
0 Number of Days 434.00
1 Missing Cells 108.00
2 Missing Cells (%) 24.88
3 Duplicated Rows 0.00
4 Duplicated Rows (%) 0.00
5 Length of Categorical Variables 1.00
6 Length of Numerical Variables 11.00
When printing the dictionary items, the data doesn't contain zeros at the end. However, the dataframe cells contain extra zeros, which cause confusion. How could I fix this issue and remove the extra zeros in the dataframe conversion from dictionary?
CodePudding user response:
Use an object
dtype to enable mixed int/floats. Don't use lists as container:
def summary_table(df):
"""
Return a summary table with the descriptive statistics about the dataframe.
"""
nulls = df.isnull().sum().sum()
dups = df.duplicated().sum()
summary = {
"Number of Days": len(df),
"Missing Cells": nulls,
"Missing Cells (%)": round(nulls / df.shape[0] * 100, 2),
"Duplicated Rows": dups,
"Duplicated Rows (%)": round(dups / df.shape[0] * 100, 2),
"Length of Categorical Variables": len([i for i in df.columns if df[i].dtype == object]),
"Length of Numerical Variables": len([i for i in df.columns if df[i].dtype != object])
}
df = pd.DataFrame(summary.items(), columns=['Description', 'Value'], dtype=object)
return df
Example:
print(summary_table(df))
Description Value
0 Number of Days 8
1 Missing Cells 0
2 Missing Cells (%) 0.0
3 Duplicated Rows 0
4 Duplicated Rows (%) 0.0
5 Length of Categorical Variables 2
6 Length of Numerical Variables 1
You could further improve your code to avoid computing duplicated indicators.
For instance:
nulls = df.isnull().sum().sum()
...
"Missing Cells": [nulls],
"Missing Cells (%)": [nulls / df.shape[0] * 100, 2)
...