Home > Enterprise >  Multiple datasets and fit line
Multiple datasets and fit line

Time:02-21

I have different datasets:

Df1 
X  Y
1  1
2  5
3  14
4  36
5  90

Df2

X  Y
1  1
2  5
3  21
4  38
5  67

Df3

X  Y
1  1
2  5
3  10
4  50
5  78

I would like to determine a line which fits this data and plot all data in one chart (like a regression). On the x axis I have the time; on the y axis I have the frequency of an event that occurs. Any help on the approach on how to determine the line and plot the results keeping the different legends (would be ok with seaborn or matplotlib) would be helpful.

What I have done so far is plotting the three lines as follows:

plot_df = pd.DataFrame(list(zip(dataset_list, x_lists, y_lists)),
               columns =['Dataset', 'X', 'Y']).set_index('Dataset', inplace=False)

plot_df= plot_df.apply(pd.Series.explode).reset_index() # this step should transpose the resulting df and explode the values

# plot
fig, ax = plt.subplots(figsize=(10,8))

for name, group in plot_df.groupby('Dataset'):
    group.plot(x = "X", y= "Y", ax=ax, label=name)

Please note that the three lists at the beginning contain information on the three different df.

CodePudding user response:

I recommend using linregress from scipy.stats as this gives very readable code. Just need to add in the logic to your loop:

from scipy.stats import linregress

for name, group in plot_df.groupby('Dataset'):
    group.plot(x = "X", y= "Y", ax=ax, label=name)
    
    #fit a line to the data
    fit = linregress(group.X, group.Y)
    
    ax.plot(group.X, group.X * fit.slope   fit.intercept, label=f'{name} fit')
  • Related