stat_ecdf without reaching 1 (right censored data)-CodePudding

I'm using ggplot2's stat_ecdf to plot the cumulative distribution of times to event. However, some events are right-censored (not yet occurred). So, I want the cumulative distribution to not reach 1. As described in the documentation, na.rm option is basically useless. Do I have to code my own ecdf...?

MWE

The problem is that stat_ecdf includes no option to treat NA as censored, so the distribution always reaches 1, including if the x-limits are adjusted (which seems fully incorrect).

library('ggplot2')
set.seed(0)
x = rbinom(8,20,.5) # true all data
# x[x>10] = NA # "censoring" as NA - same problem
g = ggplot(data.frame(x=x),aes(x=x))  
  stat_ecdf(na.rm=TRUE)   # na.rm does nothing
  lims(x=c(0,10)) # "censoring" as xlim - cdf is re-scaled to the visible data!?
  # lims(x=c(0,20)) # no "censoring" - correct
print(g)

CodePudding user response：

Here is one solution using after_stat, but it has several limitations. It works by re-scaling y values by the number of data points within the layer vs the number expected. It solves both the NA and lims problems.

MWE

library('ggplot2')
set.seed(0)
N = 8
x = rbinom(N,20,.5)   runif(N,-1e-9, 1e-9) # add noise - need a better solution...
x[x>10] = NA # "censoring" as NA
g = ggplot(data.frame(x=x),aes(x=x))  
  stat_ecdf(aes(y=after_stat(
    unlist(lapply(split(y,list(group,PANEL)),function(y){ # for groups, facets
      y * (length(y)-2) / N # core solution
    })) # for groups, facets
  )))  
  lims(x=c(0,10)) # "censoring" as xlim
  # lims(x=c(0,20)) # no "censoring"
# print(layer_data(last_plot())) # DEBUG helper
print(g)

Limitations

Internally, y already has NAs removed, and only includes data for unique values of x. As a result...

We need to know N from outside the scope of where after_stat is evaluated. This becomes a pain if N is different per group / facet.
Duplicate values reduce the length of y but not due to NA. My solution for now is to add noise to the x data (runif) before the plot, but this is obviously ugly.
Solution assumes that pad = TRUE (adds -Inf, Inf to the data), which is why we use length(y)-2 not length(y), but you can adjust for your case.

Thanks

To this answer for mentioning layer_data(last_plot()), which made this solution much easier to develop.

CodePudding user response：

The empirical cumulative distribution function will apply to the data that you provide. I don't think it can create a distribution for unknown data or that do not exist yet.

Right now you have 8 data points. How many more data points would there be until the event? If you knew that there would be 20 data points in total, and assuming that the new values are all lower or higher than what you already have, then you could estimate what your ecdf would look like. If that is the case, you can re-scale your problem and instead of reaching 1, it would reach 1/20*8 = 0.4. There are different workarounds to either re-scale your y-axis or fitted data.

But if you do not know how many more data points there are until the event, you can't just decide where you currently are in the cumulative distribution function. You know it shouldn't be 1, but where would it be then?