I need a way to do the following:
- I want to be able to loop through each column in a data-frame that is of the
dtype == object
- As I am looping through each desired column, I am obtaining the following summary statistics:
- The name of the attribute
- The number of unique values each attribute has
- The most frequently occurring value of each attribute
- The number of occurrences of the most frequently occurring value
- The percentage of the total number of values the most frequently occurring value contributes
Say I have the following data:
import pandas as pd
data = {"Sex":["M", "M", "F", "F", "F", "F", "M", "F", "F", "M"],
"Product": ["X", "Y", "Y", "Z","Y", "Y", "Y", "X", "Z", "Y"],
"Answer":["Yes", "Yes", "Yes", "No", "No", "Yes", "Yes", "No", "Yes", "Yes"],
"Numeric":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"Binary":[True, True, False, False, False, True, False, True, True, False]}
df = pd.DataFrame(data)
I need some way to do the following (I know the code isn't correct - hopefully my pseudo-code conveys what I mean)
for column in df:
if df[column].dtype != object:
continue
else:
#pseudo-code begins
attributes = df[column].name #attribute names
unique_values = df[column].unique() #number of unique values
most_frequent = df[column].mode() #most frequently occurring
occurrences = df[column][most_frequent].nunique() #occurrences of the mode
proportions = df[column][most_frequent].value_counts(normalize = True) #to get
#proportions
I then need some way to tabulate all this information into some kind of summary statistics table
summaryStats = tabluate([attributes, unique_values, most_frequent, occrrences, proportions])
The resulting output should look something like this
| Attribute | Unique Values | Mode | Occurrences | % of Total |
|----------- --------------- ------ ------------- ------------|
| Sex F, M, F 6 60% |
| Product X, Y, Z Y 7 70% |
| ... ... ... ... ... |
#...and so on and so forth for the other attributes
I am essentially creating a summary table for discrete data.
Any help at all would be much appreciated :)
CodePudding user response:
Try this:
data = []
# Loop through columns of dtype `object`
for col in df.select_dtypes("object"):
# Do a count of each unique value in the column
freq = df[col].value_counts()
data.append({
"Attribute": col,
# The unique values are the index of the frequency count
"Unique Values": ", ".join(freq.index.sort_values()),
# The frequency dataframe is sorted by default, with the most-frequent
# value appearing first
"Mode": freq.index[0],
"Occurrence": freq[0],
"% of Total": freq[0] / freq.sum() * 100
})
summary = pd.DataFrame(data)
CodePudding user response:
An empty dataframe is created with columns and indexes. The Attribute column is immediately populated with a list. A list aaa is created to substitute its values into loc (line indexes are indicated on the left, column names are on the right). You can read about loc (explicit indexing) here.
The line that yields 'Unique Values' uses ','.join to concatenate the list into a string. In 'Occurrences' through isin we get a Boolean mask, with which we get all unique values and count. In '% of Total' we extract the first element since the data is sorted and multiply by 100. Also, you should have 6 in Product Occurrences, of course, and % of Total will be different.
df1 = pd.DataFrame(index=[0, 1], columns=['Attribute', 'Unique Values', 'Mode', 'Occurrences', '% of Total'])
df1['Attribute'] = ['Sex', 'Product']
aaa = ['Sex', 'Product']
for i in range(len(aaa)):
df1.loc[i, 'Unique Values'] = ','.join(df[aaa[i]].unique())
mod = df[aaa[i]].mode()
df1.loc[i, 'Mode'] = mod[0]
df1.loc[i, 'Occurrences'] = df.loc[df[aaa[i]].isin(mod), aaa[i]].count()
df1.loc[i, '% of Total'] = df[aaa[i]].value_counts(normalize=True)[0] * 100
Output
Attribute Unique Values Mode Occurrences % of Total
0 Sex M,F F 6 60.0
1 Product X,Y,Z Y 6 60.0