Home > Software design >  Plot CDF of 4 columns from a CSV file using pandas
Plot CDF of 4 columns from a CSV file using pandas

Time:11-01

I want to plot CDF value of 4 columns from a CSV file using pandas as follows:

df = pd.read_csv(r'pathfile.csv')

The data of the four columns contain a timestamp in milliseconds, I have tried some codes, but they are not reporting the correct plot. Can you help with an easy way?

df = pd.read_csv(r'pathfile.csv')
def compute_distrib(df, col):
    stats_df = df.groupby(col)[col].agg('count')\
                 .pipe(pd.DataFrame).rename(columns={col: 'frequency'})
    
    # PDF
    stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency'])
    
    # CDF
    stats_df['CDF'] = stats_df['pdf'].cumsum()
    
    # modifications
    stats_df = stats_df.reset_index()\
                       .rename(columns={col:"X"})
    stats_df[" "] = col
    return stats_df

cdf = []
for col in ['1','2','3','4']: # column name in csv file
    cdf.append(compute_distrib(df, col))
cdf = pd.concat(cdf, ignore_index=True)

import seaborn as sns

sns.lineplot(x=cdf["X"],
             y=cdf["CDF"],
             hue=cdf[" "]);

CDF plot

To put in your own data, just replace the # GENERATE EXAMPLE DATA section with df = pd.read_csv('path/to/sheet.csv')

Let me know if anything in the example is unclear to you or if it needs more explanation.

CodePudding user response:

Shouldn't it be df.groupby("column").agg({'column':pd.Series.count}) ?

  • Related