Home > Blockchain >  How to loop through data-frame columns to obtain statistics
How to loop through data-frame columns to obtain statistics

Time:10-24

I need a way to do the following:

  • I want to be able to loop through each column in a data-frame that is of the dtype == object
  • As I am looping through each desired column, I am obtaining the following summary statistics:
    • The name of the attribute
    • The number of unique values each attribute has
    • The most frequently occurring value of each attribute
    • The number of occurrences of the most frequently occurring value
    • The percentage of the total number of values the most frequently occurring value contributes

Say I have the following data:

import pandas as pd

data = {"Sex":["M", "M", "F", "F", "F", "F", "M", "F", "F",  "M"],
    "Product": ["X", "Y", "Y", "Z","Y", "Y", "Y", "X", "Z", "Y"],
    "Answer":["Yes", "Yes", "Yes", "No",  "No", "Yes", "Yes", "No",  "Yes", "Yes"],
    "Numeric":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "Binary":[True, True, False, False, False, True, False, True, True, False]}

df = pd.DataFrame(data)

I need some way to do the following (I know the code isn't correct - hopefully my pseudo-code conveys what I mean)

for column in df:
    if df[column].dtype != object:
        continue
    else:
        #pseudo-code begins
        attributes = df[column].name #attribute names
        unique_values = df[column].unique() #number of unique values
        most_frequent = df[column].mode() #most frequently occurring
        occurrences = df[column][most_frequent].nunique() #occurrences of the mode
        proportions = df[column][most_frequent].value_counts(normalize = True) #to get 
        #proportions

I then need some way to tabulate all this information into some kind of summary statistics table

summaryStats = tabluate([attributes, unique_values, most_frequent, occrrences, proportions])

The resulting output should look something like this

| Attribute | Unique Values | Mode | Occurrences | % of Total |
|----------- --------------- ------ ------------- ------------|
| Sex          F, M,          F         6             60%     |
| Product      X, Y, Z        Y         7             70%     |
| ...            ...          ...      ...           ...      |

#...and so on and so forth for the other attributes

I am essentially creating a summary table for discrete data.

Any help at all would be much appreciated :)

CodePudding user response:

Try this:

data = []

# Loop through columns of dtype `object`
for col in df.select_dtypes("object"):
    # Do a count of each unique value in the column
    freq = df[col].value_counts()

    data.append({
        "Attribute": col,
        # The unique values are the index of the frequency count
        "Unique Values": ", ".join(freq.index.sort_values()),
        # The frequency dataframe is sorted by default, with the most-frequent
        # value appearing first
        "Mode": freq.index[0],
        "Occurrence": freq[0],
        "% of Total": freq[0] / freq.sum() * 100
    })

summary = pd.DataFrame(data)

CodePudding user response:

An empty dataframe is created with columns and indexes. The Attribute column is immediately populated with a list. A list aaa is created to substitute its values into loc (line indexes are indicated on the left, column names are on the right). You can read about loc (explicit indexing) here.

The line that yields 'Unique Values' uses ','.join to concatenate the list into a string. In 'Occurrences' through isin we get a Boolean mask, with which we get all unique values and count. In '% of Total' we extract the first element since the data is sorted and multiply by 100. Also, you should have 6 in Product Occurrences, of course, and % of Total will be different.

df1 = pd.DataFrame(index=[0, 1], columns=['Attribute', 'Unique Values', 'Mode', 'Occurrences', '% of Total'])
df1['Attribute'] = ['Sex', 'Product']
aaa = ['Sex', 'Product']

for i in range(len(aaa)):
    df1.loc[i, 'Unique Values'] = ','.join(df[aaa[i]].unique())
    mod = df[aaa[i]].mode()
    df1.loc[i, 'Mode'] = mod[0]
    df1.loc[i, 'Occurrences'] = df.loc[df[aaa[i]].isin(mod), aaa[i]].count()
    df1.loc[i, '% of Total'] = df[aaa[i]].value_counts(normalize=True)[0] * 100
    

Output

  Attribute Unique Values Mode Occurrences % of Total
0       Sex           M,F    F           6       60.0
1   Product         X,Y,Z    Y           6       60.0
  • Related