Home > Enterprise >  Create a dataframe with columns and their unique values in pandas
Create a dataframe with columns and their unique values in pandas

Time:11-11

I have tried looking for a way to create a dataframe of columns and their unique values. I know this has less use cases but would be a great way to get an initial idea of unique values. It would look something like this....

State County City
Colorado Denver Denver
Colorado El Paso Colorado Springs
Colorado Larimar Fort Collins
Colorado Larimar Loveland

Turns into this...

State County City
Colorado Denver Denver
El Paso Colorado Springs
Larimar Fort Collins
Loveland

CodePudding user response:

I would use mask and a lambda

df.mask(df.apply(lambda x : x.duplicated(keep='first'))).fillna('')

      State   County              City
0  Colorado   Denver            Denver
1            El Paso  Colorado Springs
2            Larimar      Fort Collins
3                             Loveland

CodePudding user response:

This is the best solution I have come up with, hope to help others looking for something like it!

def create_unique_df(df) -> pd.DataFrame:
    """ take a dataframe and creates a new one containing unique values for each column
    note, it only works for two columns or more

    :param df: dataframe you want see unique values for
    :param type: pandas.DataFrame
    return: dataframe of columns with unique values
    """
    # using list() allows us to combine lists down the line
    data_series = df.apply(lambda x: list( x.unique() ) )

    list_df = data_series.to_frame()

    # to create a df from lists they all neet to be the same leng. so we can append null 
    # values
    # to lists and make them the same length. First find differenc in length of longest list and
    # the rest
    list_df['needed_nulls'] = list_df[0].str.len().max() - list_df[0].str.len()

    # Second create a column of lists with one None value
    list_df['null_list_placeholder'] = [[None] for _ in range(list_df.shape[0])]

    # Third multiply the null list times the difference to get a list we can add to the list of
    # unique values making all the lists the same length. Example: [None] * 3  == [None, None, 
    # None]
    list_df['null_list_needed'] = list_df.null_list_placeholder * list_df.needed_nulls
    list_df['full_list'] = list_df[0]   list_df.null_list_needed

    unique_df = pd.DataFrame(
        list_df['full_list'].to_dict()
    )

    return unique_df

CodePudding user response:

Original dataframe. Would be nice if next time you ask a question, you give us reproducible code we can work with to help answer your question. Here, I do that for you, but just know for next time you ask a question.

import pandas as pd

df = pd.DataFrame({
    'State': ['Colorado', 'Colorado', 'Colorado', 'Colorado'], 
    'County': ['Denver', 'El Paso', 'Larimar', 'Larimar'],
    'City': ['Denver', 'Colorado Springs', 'Fort Collins', 'Loveland']
})
df

    State     County   City
0   Colorado  Denver   Denver
1   Colorado  El Paso  Colorado Springs
2   Colorado  Larimar  Fort Collins
3   Colorado  Larimar  Loveland

Drop duplicates from each column separately and then concatenate. Fill NaN with empty string.

pd.concat([
    df.State.drop_duplicates(),
    df.County.drop_duplicates(),
    df.City.drop_duplicates()
], axis=1).fillna('')

Output:

    State     County   City
0   Colorado  Denver   Denver
1             El Paso  Colorado Springs
2             Larimar  Fort Collins
3                      Loveland
  • Related