\documentclass{report}[11pt] \usepackage{Sweave} \usepackage{amsmath} \addtolength{\textwidth}{1in} \addtolength{\oddsidemargin}{-.5in} \setlength{\evensidemargin}{\oddsidemargin} %\VignetteIndexEntry{Counting process and timeline data sets} \SweaveOpts{keep.source=TRUE, fig=FALSE} % Ross Ihaka suggestions \DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} % I had been putting figures in the figures/ directory, but the standard % R build script does not copy it and then R CMD check fails \SweaveOpts{prefix.string=surv,width=6,height=4} \newcommand{\code}[1]{\texttt{#1}} \setkeys{Gin}{width=\textwidth} <>= options(continue=" ", width=70) options(SweaveHooks=list(fig=function() par(mar=c(4.1, 4.1, .3, 1.1)))) pdf.options(pointsize=10) #text in graph about the same as regular text options(contrasts=c("contr.treatment", "contr.poly")) #ensure default library("survival") palette(c("#000000", "#D95F02", "#1B9E77", "#7570B3", "#E7298A", "#66A61E")) @ \title{Counting process and timeline data sets} \author{Terry Therneau} \begin{document} \maketitle \section{Dataset types} \subsection{Counting process} Survival data for the package is supported in four types: \begin{enumerate} \item The simple \code{Surv(time, status)} form, where the status is 0/1 or FALSE/TRUE for alive/dead. \item Variants on the first for left and interval censored data, which are largely used by the parametric survival routine \code{survreg}. \item The counting process form \code{Surv(time1, time2, status)}. \item Timeline form, which is the newest addition and is particularly useful for multistate models \end{enumerate} At it origin, the survfit and coxph routines supported only the simple style, and this is still true for some of the functions in the survival pacakge, e.g., survdiff. The counting process style was originally added solely to support time-dependent covariates. Data sets have a structure like one below, which displays data from the primary biliary cirrhosis study. The counting process data is created from the \code{pbcseq} data using the \code{tmerge} routine. <>= p1 <- subset(pbcseq, !duplicated(id)) pdata <- tmerge(p1[,c(1,3:5)], p1, id=id, death= event(futime, status==2)) pdata <- tmerge(pdata, pbcseq, id=id, bili= tdc(day, bili), ascites=tdc(day, ascites), chol = tdc(day, chol)) pdata$age <- round(pdata$age,1) pdata$death <- 1*pdata$death subset(pdata, id<3, c(id, tstart, tstop, death, age, bili, ascites, chol)) @ For a simple time-dependent data set like this we have the subject id, the time interval (tstart, tstop], the death indicator, a fixed covariate of age at enrollment, followed by time dependent covariates of bilirubin, ascites, and chol count. Each row contains the covariate values that apply over the interval (tstart, tstop], along with the indicator of whether the interval ends in an event (death). The use of open and closed brackets (tstart, tstop] for the intervals is purposefull and necessary. For example, in the PBC data subject 185 dies on day 2882, and subject 2 listed above is in the risk set for that death, but also has a new bilirubin recorded on that day. In a Cox model assessing the relationship of bilirubin and ascites to the risk of death, should the model use the value of 4.2 (row 9) or 3.6 (row 10) as subject 2's bilirubin at that death time? The proper answer is that the code should use the first. The underlying theory for the Cox model, which is based on martingales and counting processes, requires that covariate values used for an event must be known strictly before the time of that event. In analogy to a gambling game, bets must be placed before the dice leave the shooter's hand. It was fairly quickly discovered that this form for the data made it easy to handle time-dependent strata, and within another year that it also allows for multiple events per subject, which was useful in a study of recurrent myocardial infarction. (I forsaw neither when the time1, time2 form was first created). The \code{cdg} data is an example of multiple events of the same type, and use of the Andersen-Gill model for analysis. The status variable can now have a value of 1 multiple times for a subject. Use of a factor as the ``status'' variable extends the \code{Surv(time, status)} form to the case of competing risks and the \code{Surv(time1, time2, status)} form to general multi-state models. The first level of the factor is assumed to be ``censored'', though for a multi-state model ``no transistion to another state occured at the end of this interval'' is the more accurate description. The label associated with this first or censoring level can be anything the user desires, the author uses both ``censored'' and ``none'' fairly commonly. Multistate counting process data sets can get complex, particularly if there are also time-dependent covariates (which is often the case). The code below builds a time-dependent data set for the PBC data with two different state variables: simple alive/dead, and a second for modeling the progression of bilirubin over time. The second corresponds to the state space illustrated below. If subjects were under continuous observation we would expect only the state changes shown by the black arrows. <>= bstate <- c("normal", "bili (1,4]", "bili >4", "death") bmat <- matrix(0,4,4, dimnames= list(bstate, bstate)) bmat[1,2] <- bmat[2,3] <- 1 bmat[2,1] <- bmat[3,2] <- 1 bmat[1:3, 4] <- 1 bmat[1,3] <- 0.75 bmat[3,1] <- 1.5 lty <- (1+ 1*(bmat!=1))[bmat!=0] statefig(rbind(3,1), bmat, offset=.01, acol=c("black", "gray")[lty]) @ Here is the code to create the relevant data using tmerge. Feel free to skip over the code itself and concentrate on the final printout for subjects 1--2. <>= p1 <- subset(pbcseq, !duplicated(id)) pdata <- tmerge(p1[,c(1,3:5)], p1, id=id, death= event(futime, 1*(status==2))) pdata <- tmerge(pdata, pbcseq, id=id, bili= tdc(day, bili), ascites=tdc(day, ascites), chol = tdc(day, chol)) pdata$age <- round(pdata$age,1) bili3 <- cut(pbcseq$bili, c(0, 1, 4, 50), c("normal", "1-4", "4+")) # two 0-1 visits in a row is not a transition b3e <- nostutter(pbcseq$id, as.numeric(bili3)) pdata2 <- tmerge(pdata, pbcseq, id= id, bili3 = tdc(day, bili3), bstate= event(day, b3e)) temp <- with(pdata2, ifelse(death==1, 4*death, as.numeric(bstate) -1L)) pdata2$bstate <- factor(temp, 0:4, c("censor", "normal", "1-4", "4+", "death")) subset(pdata2, id<3, c(id, tstart, tstop, death, age, bili, ascites, chol, bili3, bstate)) @ A subtle and often confusing aspect of the counting process involves covariates versus events. The bilirubin, current state (bili3) and transition (bstate) variables in the data frame all originate from the same values, but the first two apply to the entire interval and the last describes any transitions to a new state at the end of the interval. Subject 1, for instance is in the bilirubin $> 4$ category for both the (0,192] and (192, 400] interval, their status (bevent) variable is censor (no change) at time 192 and death at time 400. Subject 2 moves between the states; the bstate variable looks ahead to the next state and only records changes of state. An important check for any counting process data set is the survcheck routine. <>= survcheck(Surv(tstart, tstop, bstate) ~1, pdata2, id=id, istate=bili3) @ It is important to not just print the table, but carefully \emph{read} it. \begin{itemize} \item We see no transitions from death to another state, including death:death (died twice), nor any follow-up time after death. This is as it should be. \item There were no warning messages about invalid transitions or time intervals. \item Because PBC is a progressive disease we expect more transitions above than below the diagonal in the transitions table. \end{itemize} The counting process style offers great flexibility. Unfortunately, it is easy to create data sets which are not valid. With rare exceptions, the final rows representing a subject should follow a simple rule: the desribed path is physically possible. In particular \begin{enumerate} \item A subject cannot be two places at once (no overlapping intervals) \item They cannot disappear and then return (no disjoint intervals) \item Any state that is entered must be occupied for a positive time interval (no zero length intervals, current state and transitions must be consistent) \end{enumerate} The tmerge routine does a few things to help this process, e.g., the tdc() and event() operators signal changes that should occur at the start and end of an interval, respectively. One particular action of tdc() is to fill in missing values using last value carried forward (LVCF). Looking at the \code{pbcseq} data, for instance, subject 2 has a cholesterol of 302 at baseline, is missing a cholesterol measurement on visits 2--4, and has a new value of 230 at visit 5. In the \code{pdata2} data set above has this been filled in. If this were not done then the standard ``delete all rows with missing values'' action, which is the first step of the coxph or survfit routines, would result in a data set that violated rule 2 above, leading to an invalid analysis. (Or an error message when those routines do an interal call of the survcheck code, see the \code{survcheckallow} argument of \code{coxph.control}). An example where we once allowed a violation of rule 2 was a cancer clinical trial with long follow-up. One patient completely disappeared from the surveillance system for 3 years, then re-appeared at the clinic one day and resumed regular appointments. After some discussion, we assumed that neither death nor progression would have been discovered had they occured in the interim, but that the information post rejoining was useful, so we allowed disjoint intervals of follow up. But this remains a rare case. Below are two Aalen-Johansen fits, the first with bilirubin as a time-dependent covariate, the second with it a state. <>= psurv1 <- survfit(Surv(tstart, tstop, death) ~ bili3, pdata2, id=id) psurv2a <- survfit(Surv(tstart, tstop, bstate) ~ 1, pdata2, id= id, istate= bili3, p0=c(1,0,0,0)) psurv2b <- survfit(Surv(tstart, tstop, bstate) ~ 1, pdata2, id= id, istate= bili3, p0=c(0,1,0,0)) psurv2c <- survfit(Surv(tstart, tstop, bstate) ~ 1, pdata2, id= id, istate= bili3, p0=c(0,0,1,0)) if (FALSE) { #if I show it I have to explain it plot(psurv1, col=1:3, fun= "event", lwd=2, xscale=365.25, xlab= "Years from randomization", ylab="Death") lines(psurv2c[4], col=3, lwd=2, lty=2, conf.int=F) lines(psurv2b[4], col=2, lwd=2, lty=2, conf.int=F) lines(psurv2a[4], col=1, lwd=2, lty=2, conf.int=F) } @ \section{Timeline data} A more recent addition to to package is to allow what we call a \emph{timeline} data sets. Such data has a unique (subject identifier, time) pair that identifies each row of data, along with information that was observed at that time. Time constant covariates such a genotype can appear on all rows, or only when first recorded, the latter obeying the strict ``observed at that time'' rule. For simple survival for instance, the timeline form would have two rows per subject. The first 3 subjects of the lung cancer data set would appear as below (some variables omitted from the printout for width). <>= lung2 <- data.frame(id=1:228, time=0, death=0, lung[, -(2:3)]) temp <- data.frame(id=1:228, time=lung$time, death= lung$status-1) lung2 <- merge(lung2, temp, all=TRUE) subset(lung2, id<4, c(id, time, death, inst, age, sex, ph.ecog, pat.karno)) @ This is clearly not an advantageous approach for either simple survival or competing risks data, which have only one row per subject in the standard form. The advantage comes with multistate data sets: we no longer need to distinguish between covariates and outcomes in the data, and data sets can be created with any number of tools, we are not tied to tmerge. Here is code for the sequential pbc data set earlier. <>= # separate out death, and add it on the end # death yes/no *is* observed every visit temp <- with(subset(pbcseq, !duplicated(id)), data.frame(id=id, day =futime, death= 1*(status==2))) pdata3 <- cbind(pbcseq[, -(2:3)], death=0) pdata3 <- merge(pdata3, temp, all=TRUE) # create a factor for joint outcome temp2 <- as.numeric(cut(pdata3$bili, c(0,1,4, 50))) temp3 <- ifelse(pdata3$death==1, 4, temp2) pdata3$bstate <- factor(temp3, 1:4, c("normal","1-4", "4+", "death")) subset(pdata3, id<3, c(id, day, death, age, bili, ascites, chol, bstate)) psurv3 <- survfit(Surv2(day, death) ~ bstate, pdata3, id= id) ii <- match("call", names(psurv3)) all.equal(unclass(psurv1)[-ii], unclass(psurv3)[-ii]) @ The call is different, but the result of the fit is the same. Internally, the timeline data set is first rewritten into counting process form, and then computations are done using the counting process data. The use of \code{Surv2} instead of \code{Surv} informs the routine that the data is of timeline form. A user version of the transformed data can be obtained using the \code{fromtimeline} function. Such a call may sometimes be necessary, i.e., if one wants to change the arguments of \code{lvcf=TRUE} and \code{repeated=FALSE}, which are the default arguments for the internal call. Footnote: We serious considered making the detection of the timeline form be automatic, i.e., any time there was an id statement, multiple rows per subject, and simple Surv(time, status) form. There are, however, a few exceptions. One example is the diabetic retinopathy study, where each patient has one treated and one untreated eye. There are two rows per subject (eyes) that both start at time 0. A survcheck call will complain about overlapping time intervals, but the following survfit call is perfectly legal, both statistically and in the code. <>= rfit1 <- survfit(Surv(futime, status) ~ trt, id=id, retinopathy) rfit2 <- survfit(Surv(futime, status) ~ trt, cluster=id, retinopathy) @ The two calls give the same result. The help file for the retinopathy data shows the second form, and indeed for all valid cases that we can think of with simple (time, status) data and multiple rows per subject the cluster form is both valid and clearer, However, the current decision is to require Surv2 and not try to auto detect the timeline case, a primary argument being that it makes the code somewhat clearer to a user. \section{Multistate models and missing values} One of the features of the multistate \code{coxph} call is the ability to have certain covariates apply to only a subset of the transitions. Say for instance that one of the states was surgical intervention, such as in the Crohn's disease data set, and that a particular covariate depended on availablity of tissue from that procedure. Then by definition that covariate is only applicable to further transitions after the surgical state. A user might choose to use a particular covariate for only a subset of transitions for multiple reasons, however. Say that a covariate `zed' is only used for the 2:3 transition within a 3 state model. Any observation (time1, time2, status, istate) with istate not equal to 2 will be unaffected by the value of zed, or whether zed is NA. The coxph routine will, properly, not remove this observation from the risk sets for a 2:3 transition due to a missing value for zed. What about a subject with a (time1, time2) row with a current state (istate) of 2 and missing value for zed? This row will be removed due to missing, but other rows of data for this subject are still informative for other transitions. Here is a constructed example using the myeloid data. <>= tdata <- tmerge(myeloid[,1:4], myeloid, id=id, death=event(futime,death), priortx = tdc(txtime), sct= event(txtime)) tdata$event <- factor(with(tdata, sct + 2*death), 0:2, c("censor", "sct", "death")) tdata$sex[tdata$id %in% 273:275] <- NA # obs 425 to 428 tdata$flt3[tdata$id %in% 271:273] <- NA # obs 422 to 425 tdata$event[tdata$id==270 & tdata$tstart>0] <- NA subset(tdata, id %in% 270:275) check1 <- survcheck(Surv(tstart, tstop, event) ~1, tdata, id=id) check1 check2 <- survcheck(Surv(tstart, tstop, event) ~sex, tdata, id=id) fit <- coxph(list(Surv(tstart, tstop, event) ~ trt, 1:3 + 2:3 ~ sex, 1:2 + 2:3 ~ flt3), tdata, id=id) fit check1$transitions- fit$transitions @ The data set has 1009 observations on 646 subjects. Since there is no istate option, all subjects are assumed to start in state 1, which will be labeled as ``(s0)'' on the printout. The result of check1 will have 646/1008, one observation is lost due to the missing event on row 421, but no subjects. The check2 result has 643/1004 since all rows with missing sex are removed. For the coxph model the transition table reflects events that are actually counted for a transition. \begin{itemize} \item One 1:2 transition is lost, obs 427, due to missing flt. Observation 422 is missing sex, but that is not used in the 1:2 model \item One 2:3 transition is lost, obs 423 \item Two 1:3 transitions are lost, obs 425--426 \end{itemize} Three observations are removed from the model frame, 421 which is also removed in check1, 423 which is at risk for only the 2:3 transition, and 425 which has missings for both the 1:2 and 1:3 transitions and is not at risk for 2:3. This drops the number of unique subjects to 645 and the rows to 1006. Internally, the coxph code calls the survcheck function to verify legality of the data, but it does so before removal of any rows due to missing covariates. If the \code{survcheck} routine were called post removal, and some subjects' now had an induced discontinuity in follow-up an error would be signaled; but we consider this to be a false positive. (Reporting is further modified by the \code{survcheckfail} argument of \code{control.coxph}.) Likewise, user calls to survcheck will almost always want to use \verb+~1+ as the right hand side of the formula. \end{document}