\documentclass{article}
\usepackage{amsmath}
\usepackage{Sweave}
\addtolength{\textwidth}{1in}
\addtolength{\oddsidemargin}{-.5in}
\setlength{\evensidemargin}{\oddsidemargin}
\newcommand{\code}[1]{\texttt{#1}}

%\VignetteIndexEntry{Using Time Dependent Covariates}

\title{Using Time Dependent Covariates and Time Dependent Coefficients 
  in the Cox Model}
\author{Terry Therneau \and Cynthia Crowson \and Elizabeth Atkinson}
\date{May 2026}
\begin{document}
\maketitle

\SweaveOpts{prefix.string=compete,width=6,height=4}
\setkeys{Gin}{width=\textwidth}
\SweaveOpts{keep.source=TRUE}

<<preamble, echo=F>>=
options(width=60, continue=" ")
makefig <- function(file, top=1, right=1, left=4) {
    pdf(file, width=9.5, height=7, pointsize=18)
    par(mar=c(4, left, top, right) +.1)
    }
library(survival)
library(splines)
@ 

\section{Introduction}
This vignette covers 3  different but interrelated concepts:
\begin{itemize}
  \item An introduction to time dependent covariates, along with some
    of the most common mistakes.
  \item Tools for creating time-dependent covariates, or rather the
    data sets used to encode them.
  \item Time dependent coefficients.
\end{itemize}

Suppose that we want to model a medical process over time, the progression
of a chronic liver disease for instance, and that the patients in the study
have multiple visits over the time period with several covariates measured at
each visit.
It is straightforward to write down a prediction model that uses patient
information at enrollment.
One of the stengths of the Cox proportional hazard model is that it can
(properly) also use the values that are updated over time, something
known as time-dependent covariates.
If I am being seen by my physician I would like him/her to use the most
recent values, after all, not just those at inception of disease.

One of the common forms to represent this evolving data is to use
a \emph{counting process} data set.
Consider a subject with follow-up from time 0 to death at 185 days, 
and assume that we have a time dependent covariate (creatinine) that was 
measured at day 0, 90 and 120 with values of .9, 1.5, and 1.2 mg/dl.
The data for two subjects might look like the following
<<testdata, echo=FALSE, eval=TRUE>>=
tdata <- data.frame(subject=c("Smith", "Smith", "Smith", "Jones", "Jones"), 
                    time1=  c(0,90, 120, 0, 52),
                    time2 = c(90, 120, 185, 52, 90), death=c(0,0,0,0,1),
                    creatinine=c(0.9, 1.5, 1.2, 1.3, 1.8))
tdata
@ 
We read this as stating that over the interval from 0 to 90 the creatinine
for subject Smith is taken to be 0.9 (last known level), 
and that this interval did not end in a death. 
In fact, Smith was still alive at their last contact on day 185,
while Jones died on day 90.
The underlying code treats intervals as open on the left and closed on the
right, e.g. Smith's creatinine on exactly day 90 is 0.9; 
This is the value used when computing the Cox partial likelihood term 
associated with Jone's death on day 90. 


\section{Time dependent covariates and immortal time bias}
\subsection{Background}
One of the strengths of the Cox model is its ability to encompass covariates
that change over time. 
The practical reason is the form of the Cox parial likelihood, which has
a term for each death (or event) time.
$$
 PL = \prod{i \in D} \frac{e^{X_i(t_i) \beta}}
          {\sum_j Y_j(t_i) e^{X_j(t_i)\beta}}
$$
In the above $D$ is the set of events, $t_i$ is the time of the $i$th event
(which occured for subject $i$), $\beta$ the vector of coefficients, 
$X_j(t)$ the covariates for subject $j$ at time $t$, and
$Y_j(t)$ a 0/1 indicator that subject $j$ was at risk (under observation)
at time $t$.

One can think of it as a lottery model, where at 
each death time there is a drawing to decide which subject ``wins'' the
event.  Each subject's risk score $\exp(X\beta)$
determines how likely they are to win,
e.g., how many ``tickets'' they have purchased for the drawing.
The model tries to assign a risk score to each subject that best predicts
the outcome of each drawing based on 
\begin{itemize}
  \item The risk set: which subjects are present for each event; the set of
    those able to ``win the prize''.
  \item The covariate values of each subject just prior to the event time.
\end{itemize}

The model has a theoretical foundation in counting process theory
and martingales,
mathematical constructs which arose out of the study of games of chance.
Key elements of that notation are 
\begin{itemize}
  \item $Y_i(t)$: whether subject $i$ is at risk for an event, at time $t$
  \item $N_i(t)$: the cumulative count of events for subject $i$
  \item $X_i(t)$: predictive covariats for subject $i$, at time $t$
\end{itemize}

A key underlying condition for valid inference on a counting process is that
none of $Y$, $N$ or $X$ depend on the future, or more formally that each
is a predictable process. 
actions depend only on the past. The decision of whether to play
If this holds then multiple properties can be proven about the
resulting process, including the statistical aspects important to us.
The key rule for time dependent covariates in a Cox model is
essentially the same as that for gambling: \emph{you cannot look
into the future}.  The size of the wager ($X(t)$) or the decision on
whether to participate in a given spin of the roulette wheel may
may change in any way based on past data or outcomes, but it
may not reach forward in time.

We refer to the violation of this rule, for any of $X$, $Y$ or $N$ as
\emph{immortal time bias} (ITB), though the term has most often been associate
with violations of the covariates $X$.
There are myriad ways to break this rule, and we will only list a few
of the more common ones below as illustration.
The results of ITB can range from a large negative bias to a large positive
one and anything in between, the common outcome is that you can't trust
the results.

\subsection{Analysis by treatment response}
One of the more common manifestations of this error is analysis by response.
At the end of a trial a survival curve is made comparing those who had an
early response to treatment (shrinkage of tumor, lowering of cholesterol,
or whatever) to those who did not, and it discovered that responders have
a better curve.  
A Cox model fit to the same data will demonstrate a strong 
``significant'' effect.
The problem arises because any early deaths, those that occur before response 
can be assessed,
will all be assigned to the non-responder group, even deaths that have
nothing to do with the condition under study.
Below is a simple example based on the advanced lung cancer data set.
Assume that subjects came in monthly for a total of 12 cycles of treatment,
and randomly declare a ``response'' for 5\% of the subjects at each
visit. Below is a curve showing the results from a random simulation using
follow-up time and survival from the lung study.
<<fake, fig=TRUE>>=
set.seed(1953)  # a good year
nvisit <- floor(pmin(lung$time/30.5, 12))
response <- rbinom(nrow(lung), nvisit, .05) > 0
badfit <- survfit(Surv(time/365.25, status) ~ response, data=lung)
plot(badfit, mark.time=FALSE, lty=1:2, 
     xlab="Years post diagnosis", ylab="Survival")
legend(1.5, .85, c("Responders", "Non-responders"), 
       lty=2:1, bty='n')
@ 
What is most surprising about this error is the \emph{size} of the
false effect that is produced.
A Cox model using the above data reports a hazard ratio of 1.9 fold
with a p-value of less than 1 in 1000.

The alarm about this incorrect approach has been sounded often
\cite{Anderson83, Buyse96, Suissa08} but the analysis is routinely 
re-discovered.

\subsection{Analysis by dose}
A slightly subtler form of the error is discussed in Redmond et al.
\cite{Redmond83}.  The exploration was motivated by a flawed analysis presented
in Bonadonna et al. \cite{Bonadonna81} in a top medical journal.
When treating breast cancer, chemotherapy dosages are reduced for patients
who have severe toxicity, e.g., too low a white blood cell count, and the
question being asked was whether these reductions have a negative impact
on response and/or survival. To that end
breast cancer chemotherapy patients were
divided into three groups based on whether the patient eventually
received $>85$\%, 65--85\% or $<65$\% of the dose planned at the start of
their treatment. 
Since early deaths do not finish all their cycles of chemotherapy and hence
by definition finish with lower fraction, the result is guarranteed.
Early death which predicts a low total dose, and not vice-versa.
A proportional hazards model using total dose received showed a very strong
effect for dose, so much so that it could encourage a treating physician to
defer necessary dose reductions in response to treatment toxicity.
This false result actively harmed patients.

Redmond also looked at a further variant of this: 
create a variable $p$ for each subject which is the fraction of the
the target dose, received \emph{up to} the last entry for that subject.
A subject who died after receiving only 6 weeks of a planned 12 week regimen
could still score 100\%.
This looks like it should cure the bias issue,
but it turns out to give bias in the other direction.  The reason
is that dose reductions due to toxicity occur more often in the later cycles
of treatment, and thus living longer leads to smaller values of $p$.
A proportional hazards regression fit to $p$ implies that a smaller dose is
protective!
The proper approach is to code the predictor as a time-dependent covariate,
using a (time1, time2) counting process data set, with cumulative dose
to date, either total or the fraction expected to date, as the covariate.

\subsection{Interpolated values}
Say that a liver disease subject has 5 visits at 0, 1, \ldots 4 years, 
with bilirubin values of .9, 1.1, NA, 2.5, and 3, i.e.; the value at visit 3
is missing.
A common impulse is to replace the missing value with the linear interpolation
of 1.8, but this is again a case of ITB since it make use of future data;
the value of 2.5 has not yet been seen.
A variant of this would be multiple imputation that used all covariates at
all times.
What is valid would be to extrapolate as .9, 1.1, 1.3.
A good way to think of this is to imagine filling in the missing value on
the day it arose, using all the information available at that time.


\subsection{Cases versus controls}
A fairly common report in early studies of a condition is to compare all those
subjects who failed within 3 years to those who survived at least 3
years without failure, labeled as ``cases'' and ``controls'' (replace 3 with
some value appropriate to the condition). 
The problem are those who joined the study or aquired the condition 2 years
ago: severe cases who failed within a year will get counted as a case, those
who are still alive at 2 years are not counted in either group.
The result is both an overestimate of the probability of death within
3 years and biased estimates of effect: the earliest deaths, which presumably
have the most extreme predictor values, are overrepresented.
A variant of this with regard to cancer death rates was commented on over 75
years ago \cite{Berkson50}.

\subsection{Future vs future tabulation}
In a recent analysis from the Mayo Clinic
study of aging (MCSA), the primary interest is the future risks of
mild cognitive impairment (MCI), dementia, and death.
The paper starts out with a table comparing baseline covariates for those
who never progress to MCI versus those who ever did, and likewise baseline
covariates versus death.
there is also a table of baseline covariates versus survival.
Both of these are fine: if you think in terms of an R formula they
could be written with future outcome on the left hand side of the formula
and current information on the right.  
A table that compared the survival of those who did or did not progress
to MCI, however, would be invalid. 
It corresponds to a model with a future occurrence on
both sides of the equation.
Nevertheless, the table 1 of many research papers contain such a comparison.

\subsection{Summary}
There are many further variations on this error: removing subjects who do
not finish the treatment plan, imputing the date of an adverse event as midway
between observation times, requiring two sequential visits with high blood
sugar to declare diabetes but then assigned the date of diabetes to the earlier
of the two, \ldots. 
Using future data will often generate large positive or negative bias in the 
coefficients, but sometimes it generates little bias at all.
It is nearly impossible to predict a priori which of these will occur in any
given data set.
Using such a covariate is similar to crossing a freeway on foot:
disaster is not guaranteed --- but it is likely.

\section{Creating data sets}
The survival package can accomdate three forms of input: simple, 
``counting process'' and ``timeline'' data sets.
For simple survival the data has one line per subject containing their
covariates, time to event, and a status code for censored or dead.  
The latter is normally 0/1 or FALSE/TRUE for censored/event, respectively.
(As a footnote, 1/2 for censored/event is still allowed, 
a historical artifcact dating back to the earliest data sets the authors used.
In the 1960s and early 1970s
much of the data used by the Mayo Clinic section of Biostatistics was still
on punch cards, and a FORTRAN program reading numeric data from punch
cards did not distinguish 0 from blank. Hence '0' was never used as a code,
a convention and habit that outlived the use of cards.)

\subsection{Counting process form}
The counting process form has was added to the package in 1986, as a way to
deal with time dependent covariates.
For example the data set \code{mydata} might have the form
\begin{center}
  \begin{tabular}{ccccccc}
    subject  & time1 & time2 & status & age & creatinine & \ldots \\ \hline
    1 & 0 & 15 & 0 & 25 & 1.3 \\
    1 & 15& 46 & 0 & 25 & 1.5 \\
    1 & 46& 73 & 0 & 25 & 1.4 \\
    1 & 73& 100& 1 & 25 & 1.6 \\
\end{tabular}
\end{center}
In this case the variable \code{age} = age at entry to the study stays the
same from line to line, while the value of creatinine varies and is treated as
1.3 over the interval $(0, 15]$, 1.5 over $(15, 46]$, etc.  The intervals are
open on the left and closed on the right, which means that the creatinine
is taken to be 1.3 on day 15.
The status variable describes whether or not each interval ends in an event,
and will by definition be 0 for all rows of a subject except the last.

Only later did we realize that this form was useful in other ways.
Avital Cnaan pointed out fairly early a use for time-dependent strata.  She had a
data set where the outcome was the use of particular social services, with
the analysis stratified by county, as they had different policies.  Some subjects
had moved partway through.
A bit later we realized that the form could be used for repeated events, i.e.,
the Andersen-Gill model; the only change is that now some subjects will have
multiple rows with a status of 1.
Multistate models are the most recent passenger, and a bit more complex.

One common question with this data setup is whether we need to worry about
correlated data, since a given subject has multiple observations.
The answer is no, we do not.  The reason is that this representation is
simply a programming trick.
The likelihood equations at any time point
use only one copy of any subject, the program picks out the correct
row of data at each time.
There is on exception to this rule, i.e., 
when a subject appears in overlapping intervals.  This however is 
almost always a data error, since it corresponds to two copies of the

\subsection{Building time-dependent sets with tmerge}
A useful function for building data sets is \code{tmerge}, which is
part of the survival library.
The idea is to build up a time dependent data set one endpoint at a time.
The primary arguments are
\begin{itemize}
  \item data1: the base data set that will be added onto
  \item data2: the source for new information
  \item id: the subject identifier in both data1 and data2
  \item \ldots: additional arguments that add variables to the data set
  \item tstart, tstop: used to set the time range for each subject
  \item options
\end{itemize}
The created data set has three new variables (at least), which are
\code{id}, \code{tstart} and \code{tstop}.

The key part of the call are the ``\ldots'' arguments 
which each can be one of four types:
tdc() and cumtdc() add a time dependent variable, event() and cumevent()
add a new endpoint.
Time intervals are treated as open on the left and
closed on the right, i.e.,  (tstart, tstop].
Time dependent covariates apply from the start of an interval and events
occur at the end of an interval.
If a data set already had intervals of (0,10] and (10, 14] a new time
dependent covariate or event at time 8 would lead to three intervals of
(0,8], (8,10], and (10,14];
the new time-dependent covariate value would be added to the second interval,
a new event would be added to the first one.

The basic form of the function is 
<<eval=FALSE, echo=TRUE>>=
newdata <- tmerge(data1, data2, id, newvar=tdc(time, value), ...)
@ 
Where \code{data1} is the starting data set and additions to the data are
taken from \code{data2}.
The idea behind the function is that each addition will be ``slipped in''
to the original data in the same way that one
would add a new folder into a file cabinet.
It is a complex function, and we illustrate it below with a set of
examples that sequentially reveal its features.

\subsubsection{CGD data set}
Chronic granulomatous disease (CGD) is a heterogeneous group of uncommon
inherited disorders characterized by recurrent pyogenic infections that
usually begin early in life and may lead to death in childhood.
In 1986, Genentech, Inc. conducted a
randomized, double-blind, placebo-controlled trial in 128 CGD patients who
received Genentech's humanized interferon gamma (rIFN-g) or placebo three %'
times daily for a year. Data were collected on all serious
infections until the end of followup,
which occurred before day 400 for most patients.
  One patient was taken off on the day of his last infection; all others
have some followup after their last episode.

Below are the first 10 observations, see the help page for \texttt{cgd0}
for the full list of variable names. The last few columns contain the 
duration of follow-up for the subject followed by infection times.
Subject 1 was followed for 414 days and had infections on days 219 and 373, 
subject 2 had 7 infections and subject 3 had none.
\small
\begin{verbatim}
  1 204 082888 1 2 12 147.0  62.0 2 2 2 2 414 219 373
  2 204 082888 0 1 15 159.0  47.5 2 2 1 2 439   8  26 152 241 249 322 350
  3 204 082988 1 1 19 171.0  72.7 1 2 1 2 382
  4 204 091388 1 1 12 142.0  34.0 1 2 1 2 388
  5 238 092888 0 1 17 162.5  52.7 1 2 1 1 383 246 253
  6 245 093088 1 2 44 153.3  45.0 2 2 2 2 364
  7 245 093088 0 1 22 175.0  59.7 1 2 1 2 364 292
  8 245 093088 1 1  7 111.0  17.4 1 2 1 2 363
  9 238 100488 0 1 27 176.0  82.8 2 2 1 1 349 294
 10 238 100488 1 1  5 113.0  19.5 1 2 1 1 371
\end{verbatim}
\normalsize
The data set is included as \code{cgd0} in the survival
library. 
Here is the R printout of the first four subjects.
<<>>=
cgd0[1:4,]
@ 

We want to turn this into a data set that has survival in a counting
process form.
\begin{itemize}
  \item Each row of the resulting data set represents a time interval
    (time1, time2] which is open on the left and closed on the right.
    Covariate values for each row are the covariate values that apply
    over that interval.
  \item The event variable for each row $i$ is 1 if the time interval ends
    with an event and 0 otherwise.
\end{itemize}

We don't need variables etime1--etime7 in the final data set, so they are
left out of the data1 argument in the first call.
<<cgd1>>=
dim(cgd0)
newcgd <- tmerge(data1=cgd0[, 1:13], data2=cgd0, id=id, tstop=futime)
newcgd <- tmerge(newcgd, cgd0, id=id, infect = event(etime1)) 
newcgd <- tmerge(newcgd, cgd0, id=id, infect = event(etime2)) 
newcgd <- tmerge(newcgd, cgd0, id=id, infect = event(etime3)) 
newcgd <- tmerge(newcgd, cgd0, id=id, infect = event(etime4)) 
newcgd <- tmerge(newcgd, cgd0, id=id, infect = event(etime5)) 
newcgd <- tmerge(newcgd, cgd0, id=id, infect = event(etime6)) 
newcgd <- tmerge(newcgd, cgd0, id=id, infect = event(etime7)) 
newcgd <- tmerge(newcgd, newcgd, id, enum=cumtdc(tstart))
dim(newcgd)
newcgd[1:5,c(1, 4:6, 13:17)]
summary(newcgd)
coxph(Surv(tstart, tstop, infect) ~ treat + inherit + steroids,
      data =newcgd, cluster = id)
@ 

These lines show the canonical way to use tmerge: each call adds one
more bit of information to the data set.
\begin{itemize}
  \item The first call sets the \emph{time range} for each subject to
    be from 0 (default) to last follow-up.  If a later call tried to
    add an event outside that range, at time = -2 say, that addition
    would be ignored.  The range can be set explicitly by using
    the tstop and (optional) tstart arguments, or implicitly
    as will be done in the heart transplant example below.
    This first result has \Sexpr{nrow(cgd0)} rows, the same number as
    \code{cgd0}.
  \item Each additional call then adds either an endpoint or a covariate,
    splitting individual rows of the input into two as necessary.
    An \code{event} or \code{cumevent} directive adds events, while
    a \code{tdc} or \code{cumtdc} command adds a time dependent covariate.
    Events happen at the ends of intervals and time-dependent covariates 
    change at the start of an interval.  
  \item Additions from \code{data2} with a missing time value are ignored.
  \item The result of \code{tmerge} is a data frame with a few extra
    attributes. One of these, tcount, is designed to help visualize
    the process, it is printed out by the summary function.
    Assume that a subject already had 3 intervals of (2,5), (5,10)
    and (14,40).
    A new event added at time 1 would be ``early'' while one at time 50
    is after any interval and would be recorded as ``late''.
    An event at
    time 3 is within an interval, one at 5 is on the border of two
    intervals, one at 14 is at the leading edge of an interval, one at
    time 10 in on the trailing edge, and one at time 11 is in a gap.
    In this data set all new additions fell strictly within prior intervals.
    We also see that etime6 and etime7 each added only a single event to 
    the data.  Only one subject had more than 5 infections.
  \item If two observations in data2 for a single person share exactly
    the same time, the created value will be the later contribution for
    tdc() or event() calls, cumtdc() and cumevent() will add.
    The ``tied'' column tells how often this happened; in some
    data sets this behavior might not be desired and one would need to
    break the ties before calling tmerge.
  \item If there are observations in \code{data2} that do not match any
    of the identifiers in \code{data1} they count in the `missid' column.
  \item The last tmerge call adds a
    simple time-dependent variable \code{enum} which is a running observation
    count for each subject.  This can often be a useful variable in later
    models or processing, e.g. \code{enum==1} selects off the first row
    for each subject.
  \item The extra attributes of the data frame are ephemeral: they will be
    lost as soon as any further manipulation is done.  This is intentional.
    The reason is that these may be invalid if, for instance, a subset of
    the data was selected.
  \item One can verify that the resulting data set is equivalent to
    \code{cgd}, a (start, stop] version of the CGD data in the
    survival library, which had been created by hand several years earlier.
\end{itemize}

The \code{tmerge} function processes arguments sequentially, and the 
above example can be rewritten as below.  There is no computational
advantage of one form versus the other.
<<cgd1b>>= 
test  <- tmerge(cgd0[, 1:13], cgd0, id=id, tstop=futime,
                 infect = event(etime1), infect= event(etime2),
                 infect = event(etime3), infect= event(etime4),
                 infect = event(etime5), infect= event(etime6),
                 infect = event(etime7))
test <- tmerge(test, test,  id= id, enum = cumtdc(tstart))
all.equal(newcgd, test)
@ 

This example is atypical in that we have used a \emph{wide} data set, one with
multiple outcomes per subject on a single line.
The \code{tmerge} function was primarily designed to work with long data.
An underlying issue is that when you create a new variable with \code{tmerge}
from multiple source variables, and those variables are not of exactly the 
same structure, then \code{tmerge} can become confused.  If \code{etime1}
was numeric and \code{etime2} had a difftime class for example. 
If issues arise create a long form data set and merge that in, as shown
below.  Both the data.table and tidyr libraries have well regarded functions
to do this; below we use the reshape function from base R.

<<cgd1c>>=
# create a long data set with the recurrences
temp <- reshape(cgd0[c(1, 14:20)], varying= 2:8, v.names="etime",
                 idvar="id", direction="long")
cgdrecur <- subset(temp, !is.na(etime))  # toss missings (not essential)

newcgd <- tmerge(data1=cgd0[, 1:13], data2=cgd0, id=id, tstop=futime)
newcgd <- tmerge(newcgd, cgdrecur, id=id, infect= event(etime))
@ 

\subsubsection{Stanford heart transplant}
The \code{jasa} data set contains information from the Stanford heart
transplant study, in the form that it appeared in the J American Statistical
Association paper of Crowley and Hu \cite{Crowley77}.
The data set has one line per subject
which contains the baseline covariates along with
dates of enrollment, transplant, and death or last follow-up.
We want to create \code{transplant} as a time dependent covariate.
As is often the case with real data, this data set contains a few anomalies 
that need to be dealt with when setting up an analysis data set.
\begin{enumerate}
  \item One subject died on the day of entry.  However (0,0) is an illegal
    time interval for the \code{coxph} routine.  
    It suffices to have them die on day 0.5.
    An alternative is to add 1 day to everyone's follow-up, e.g., subject
    2 who enrolled on Jan 2 1968 and died on Jan 7 would be credited with 6
    days.  (This is what Kalbfleisch and Prentice do in their
    textbook.)  The result of the final \code{coxph} call is the
     same from either strategy.

  \item A subject transplanted on day 10 is considered to have been on medical
    treatment for days 1--10 and as transplanted starting on day 11.
    That is, except for patient 38 who died on the same day as their procedure.
    They should be treated as a transplant death; the problem is resolved by
    moving this forward in time by .5 day.

  \item The treatment coefficients in table 6.1 of the 
    definitive analysis found in
    Kalbfleisch and Prentice \cite{Kalbfleisch02}
    will only be obtained if covariates are defined
    in precisely the same way, since their models include interactions.
    (Table 5.2 in the original edition of the book).
    For age this is (age in days)/ 365.25 - 48 
    years, and for year of enrollment it is the number of years since the
    start of the study: (entry date - 1967/10/1)/365.25. 
    (Until I figured this out I would get occasional ``why is coxph giving
    the wrong answers'' emails.)
\end{enumerate}
Since time is in days the fractional time of 0.5 could be any value
between 0 and 1, our choice will not affect the results.

<<stanford>>=
jasa$subject <- 1:nrow(jasa)  #we need an identifier variable
tdata <- with(jasa, data.frame(subject = subject,
                               futime= pmax(.5, fu.date - accept.dt),
                               txtime= ifelse(tx.date== fu.date,
                                              (tx.date -accept.dt) -.5,
                                              (tx.date - accept.dt)),
                               fustat = fustat
                             ))
xdata <- tmerge(jasa, tdata, id=subject,
                  death = event(futime, fustat),
                  transplant   =  tdc(txtime), 
                  options= list(idname="subject"))

sdata <- tmerge(jasa, tdata, id=subject,
                  death = event(futime, fustat),
                  trt   =  tdc(txtime), 
                  options= list(idname="subject"))
attr(sdata, "tcount")
sdata$age <- sdata$age -48
sdata$year <- as.numeric(sdata$accept.dt - as.Date("1967-10-01"))/365.25

# model 6 of the table in K&P
coxph(Surv(tstart, tstop, death) ~ age*trt + surgery + year, 
      data= sdata, ties="breslow")
@ 

This example shows a special case for the \code{tmerge} function that 
is quite common: if the first
created variable is an event then the time range for each subject is
inferred to be from 0 to that event time: explicit \code{tstop} 
and \code{tstart} arguments are not required.
It also makes use of a two argument form of \code{event}.
Each of the \code{event} and \code{cumevent} functions
may have a second argument, which if present will be used as the value
for the event code.
If this second argument is not present a value of 1 is used.
If an event variable is already a part of data1, 
then our updates make changes to that variable, possibly adding more events.
This feature is what allowed for the 
\code{infection} indicator to be build up incrementally in the cgd
example above.

The \code{tdc} and \code{cumtdc} arguments can have 1, 2 or three
arguments.
The first is always the time point, the second, if present, is the
value to be inserted, and an optional third argument is the initial value.
If the \code{tdc} call has a single argument the result is always a
0/1 variable, 0 before the time point and 1 after.
For the 2 or three argument form, the starting value \emph{before} the 
first definition of the new variable (before the first time point)
will be the initial value.
The default for the initial value is NA, the value of the 
\code{tdcstart} option.
If the \code{tdc} variable being created is already a part of data1,
the old value is replaced wholesale, it is not updated.
This differs from the behavior for events.
Although there is a use case for updating an existing time-dependent value, 
say from a new data source, our experience had been that the much more
common case was accidental reuse of an existing name, resulting in
a chimeric mishmash between an existing baseline covariate and the additions,
and creating a column which was valid for neither.
Equally, some examples showed that we can not reliably update a tdc, i.e., 
cases where the correct answer is unclear.

The \code{tcount} table for the above fit shows all the deaths at the
trailing edge of their interval, which is expected since the time
of death or last follow-up was
used to define each subject's interval of risk.
Two of the transplants happened on day 0
and are listed as occurring on the leading edge of the first
follow-up interval for the subject.  
The other 67 transplants were strictly within the (0, last follow up)
interval of each subject.

\subsubsection{PBC data}
The \code{pbc} data set contains baseline data and follow-up status
for a set of subjects with primary biliary cirrhosis, while the
\code{pbcseq} data set contains repeated laboratory values for those
subjects.  
The first data set contains data on 312 subjects in a clinical trial plus
106 that agreed to be followed off protocol, the second data set has data
only on the trial subjects.

<<pbc>>=
temp <- subset(pbc, id <= 312, select=c(id:sex, stage)) # baseline
pbc2 <- tmerge(temp, temp, id=id, death = event(time, status)) #set range
pbc2 <- tmerge(pbc2, pbcseq, id=id, ascites = tdc(day, ascites),
               bili = tdc(day, bili), albumin = tdc(day, albumin),
               protime = tdc(day, protime), alk.phos = tdc(day, alk.phos))
fit1 <- coxph(Surv(time, status==2) ~ log(bili) + log(protime), pbc)
fit2 <- coxph(Surv(tstart, tstop, death==2) ~ log(bili) + log(protime), pbc2)
rbind('baseline fit' = coef(fit1),
      'time dependent' = coef(fit2))
@ 
We start the build with a baseline data set that has a subset of
the variables.  
This is due to my own frugality --- I happen to like data sets that are 
more trim.  It is not a requirement of the tmerge function, however,
and a user is certainly free to skip the first step above and build
\code{pbc2} directly from data set \code{pbc}.

The coefficients of bilirubin and prothrombin time are somewhat larger in
the time-dependent analysis than the fit using only baseline values.  
In this autoimmune disease there is steady progression of liver damage,
accompanied by a steady rise in these two markers of dysfunction.  
The baseline analysis captures patients' disease status at the start, the
time-dependent analysis is able to account for those who progress more quickly.
In the pbc data set the status variable is 0= censored, 1= liver transplant 
and 2= death; the above analyses were models of time to death, censoring
at transplant.  
(At the time of the PBC study liver transplantation was still in its infancy and
it is fair to view the 19/312 subjects who received the procedure as a
random sample. 
In the modern era there are far more waiting recipients than organs and
available livers are directed to those patients who illness is most dire;
censoring at transplant would not lead to an interpretable result.)

By default \code{tmerge} ignores any updates from \code{data2} that have
a missing value for either the time or the value.
In the pbcseq data set there are several observations with a 
missing alkaline phosphotase value.
A consequence of this behavior is that the pbc2 data set effectively
uses ``last value carried forward'' values for alk.phos, replacing those
missing values. 
Subject 6 for instance has a total follow-up of 2503, and alk.phos values
of 682 and NA on days 1492 and 2453, respectively; in the final data
set it is coded 682 from day 1492 until last follow up.
One can change this default by adding 
\code{options=list(na.rm=FALSE)} to the second call above, in which case 
the alkaline phosphotase value over the
interval (2453, 2503] will become missing.
Any \code{tdc} calls with a missing time are still ignored, independent
of the na.rm value, since we would not know where to insert them.

<<>>=
attr(pbc2, "tcount")
@
<<echo=FALSE>>=
#grab a couple of numbers for the paragraph below
atemp <- attr(pbc2, "tcount")[2:3,]
@ 
The tcount results are interesting.  For the first addition of ascites we
have \Sexpr{atemp[1, 'leading']} observations on a leading edge of follow up,
which is all of
the baseline lab values at time 0,  
and \Sexpr{atemp[1, 'within']} further additions within the subjects' follow-up
interval.  The latter cause a new break point to be added at each of these
intermediate laboratory dates,
for subsequent additions these \Sexpr{atemp[1, 'within']} times lie on
a boundary of two intervals.  
Another \Sexpr{atemp[1, 'late']} non-missing alkaline phosphotase
values occurred after the last follow-up date
of the pbc data set and are ignored.  
Bilirubin is missing on no subjects, so its addition creates a few more
unique break points in the follow-up, namely those clinical visits for
which the ascites value was missing.

The data for the pbcseq data set was assembled at a later calendar time
than the primary data set.
Since having lab test results is a certain marker that the patient is still
alive,
would a  better analysis have used this test information
to extend the last follow-up date for these \Sexpr{atemp[2,'late']} 
``late'' subjects
with a later laboratory date?
Not necessarily.

Odd things happen in survival analysis when risk sets are extended piecemeal.
A basic tenet of the Cox model is that if someone is
marked as being ``at risk'' over some interval $(s, t)$, this means that 
``if they had had an event over that interval, we would have recorded it.''
Say someone ended their initial follow-up time at 3000 days and then had a
lab test at 3350 days (subjects returned about once a year).
If we only extend the time of those who had a test, then saying that this
subject was at risk during the interval (3000, 3350) is false: if they
had died in that interval, they would not have had the lab test and would not
obtained the extension, nor would their death have been updated in the
original \code{pbc} data set.
The cutoff rule of \code{tmerge} is purposefully conservative to avoid
creating such anomalies.

In the case of the PBC data set this author happens to know that active
follow-up \emph{was} continued for all subjects, both those that did and did not
return for further laboratory tests.  This updated follow-up information is
included in the pbcseq data and could have been used to set a wider 
time range.  Such is not always the case, however.  
Automatic additions to a data set via electronic systems can be particularly
troublesome.  
One case from the author's experience involved a study of patient outcomes
after organ transplant. Cases
were actively followed up for 3 years, at which time priorities shifted and 
the clerical staff responsible for the active follow-up were reassigned.  
Automatic updates
from a state death index continued to accumulate, however.  A Kaplan-Meier
curve computed at 5 years showed the remarkable result of a 3 year survival 
of .9 followed by a precipitous drop to 0 at 5 years!
This is because there was, by definition, 100\% mortality
in all those subjects with more than 3 years of supposed follow-up.

\subsubsection{Time delay and other options}
The \code{options} argument to the tmerge routine is a list with
one or more of the following five elements, listed below along with their
default values.
\begin{itemize}
  \item idname = 'id'
  \item tstartname = 'tstart'
  \item tstopname = 'tstop'
  \item na.rm = TRUE
  \item delay  = 0
\end{itemize}
The first three of these are the variable names that will be used
for the identifier, start, and stop variables which are added to 
the output data set. 
They only need to be specified one time within a series of tmerge calls in
order to effect a change.
The na.rm option has been discussed above; it affects tdc() and
cumtdc() directives within a single tmerge call.
The delay option causes any tdc
or cumtdc action in the tmerge call to be delayed by a fixed amount.
The final two tmerge calls below are \emph{almost} identical in their action:
<<echo=TRUE, eval=FALSE>>=
temp <- subset(pbc, id <= 312, select=c(id:sex, stage))
pbc2 <- tmerge(temp, temp, id=id, death = event(time, status))
pbc2a <- tmerge(pbc2, pbcseq, id=id, ascites = tdc(day, ascites),
               bili = tdc(day, bili), options= list(delay=14))
pbc2b <- tmerge(pbc2, pbcseq, id=id, ascites = tdc(day+14, ascites),
               bili = tdc(day+14, bili))
@ 
The difference between \code{pbc2a} and \code{pbc2b}
is that the first call does not defer
baseline values for each subject, i.e., any value with a time that is
on or before the the  subject's first time point, as that will introduce 
intervals with a missing value into the result.

The more important question is \emph{why} one would wish to delay
or lag a time dependent covariate.
One reason is to check for cases of reverse causality.  It is sometimes
the case that a covariate measured soon before death is not a predictor
of death but rather is simply a marker for an event that is
already in progress.
A simple example would the the time dependent covariate ``have called
the family for a final visit''.  
A less obvious one from the author's experience occurs when a clinical
visit spans more than one day, the endpoint is progression, and one or more
laboratory results that were used to define ``progression'' get recorded in
the data set 1-2 days before the progression event.  
(They were perhaps pulled automatically from a laboratory information system).
One then ends up with the tautology of a test value predicting its own result.

Even more subtle biases can occur via coding errors.
For any data set containing constructed time-dependent covariates,
it has become the author's practice to re-run the analyses after adding 
a 7-14 day lag to key variables.
When the results show a substantial change, and this is not infrequent,
understanding why this occurred is an critical step.  
Even if there is not an actual error, one has to question the value of a
covariate that can predict death within the next week but fails 
for a longer horizon. 

\subsubsection{Cumulative events}
The action of the \code{cumevent} operator is different than \code{cumtdc}
in several ways.  Say that we have a subject with outcomes of one
type at times 5, 10, and 15 and another type at times 6 and 15, with a
follow-up interval of 0 to 20. 
For illustration I'll call the first event 'asthma' and the second 'IBD' 
(a disease
flare in inflammatory bowel disease).  A resulting data set would have the
following form:
\begin{center}
  \begin{tabular}{rcccc}
    &\multicolumn{2}{c}{cumtdc} & \multicolumn{2}{c}{cumevent} \\
interval  & asthma & IBD & asthma & IBD \\ \hline
(0, 5] & 0 &0 & 1 & 0 \\
(5, 6] & 1 &0 & 0 & 1 \\
(6, 10]& 1 &1 & 2 & 0 \\
(10, 15] & 2& 1& 3 & 2 \\
(15, 20] & 3& 2& 0 & 0
\end{tabular}
\end{center}
Events happen at the ends of an interval and time-dependent covariates
change the following intervals. 
More importantly, time-dependent covariates persist while events
do not, a \code{cumevent} action simply changes the label attached
to an event.  

\subsubsection{REP}
The motivating case for \code{tmerge} came from a particular problem:
the Rochester Epidemiology Project has tracked all subjects living in
Olmsted County, Minnesota, from 1965 to the present.  
For an investigation of cumulative comorbidity we had three data sets
\begin{itemize}
  \item base: demographic data such as sex and birth date
  \item timeline: one or more rows for each subject containing 
    age intervals during which they were a resident of the county.  
    The important variables are id, age1 and age2; each (age1, age2)
    pair marks an interval of residence. Disjoint intervals are not
    uncommon.
  \item outcome: one row for the first occurrence of each outcome of interest.  
    The outcomes were 20 comorbid conditions as defined by a particular
    research initiative from the National Institutes of Health.
\end{itemize}

The structure for building the data is shown below.
(The data for this example unfortunately cannot be included with the survival
library so the code is shown but not executed.)

<<rep, eval=FALSE, echo=TRUE>>=
newd <- tmerge(data1=base, data2=timeline, id=repid, tstart=age1, 
               tstop=age2, options(id="repid"))
newd <- tmerge(newd, outcome, id=repid, mcount = cumtdc(age))
newd <- tmerge(newd, subset(outcome, event='diabetes'), 
               diabetes= tdc(age))
newd <- tmerge(newd, subset(outcome, event='arthritis'), 
               arthritis= tdc(age))
@ 
The first call to tmerge adds the time line for each observation to the
baseline data.
For this first call both data1 and data2 
must contain a copy of the
id variable (here \code{repid}), and data1 is constrained to have only a
single line for each id value. (Subjects have a single baseline.) 
Each subsequent call adds a new variable to the data set.
The second line creates a covariate which is a cumulative count of
the number of comorbidities thus far for each subject.  
The third line creates a time dependent covariate (tdc) which will
be 0 until the age of diabetes and is 1 thereafter, the fourth
line creates a time dependent variable for the presence of arthritis.

Time dependent covariates that occur before the start of a subject's
follow-up interval or during a gap in time do not generate a new time
point, but they do set the value of that covariate for future times.
Events that occur in a gap are not counted.
The rationale is that during a subject's time within the county we would
like the variable ``prior diagnosis of diabetes'' 
to be accurate, even if that diagnosis occurred during a prior period when
the subject was not a resident. 
For events outside of the time line, we have no way to know who the appropriate
comparison group is, and so must ignore those events.  
(Formally, the risk set would be the set of
all non-residents who, if they were to have had an event at the same age, 
we would find out about it
because they will later move to the county, have a medical encounter here,
and have that event written into the ``prior
conditions'' section of their medical record.)

\subsection{Timeline data}
A timeline data set for time-dependent covariates and/or multiple events
has a simpler form.
Each row must contain a subject identifier and a time, followed by the
covariate values known at that time.  
There is no restriction on the names of these two variables, and there
can be no duplicates of the (id, time) pair.
(Violations of the latter will only be found when the data set is used
in for instance a survfit or coxph call.)

The timeline form is in some sense more natural in that it can be created
using standard data manipulation functions. 
For simple survival data is a bit more clumsy since every subject will
need to have two rows of data.
For example the first 3 subjects of the PBC data set will appear as
(showing only some of the covariates)
<<timeline1, echo=FALSE>>=
p3 <- data.frame(id=c(1,1,2,2,3,3), status=c(0,2,0,0,0,2), 
                 time=c(0,400, 0, 4500, 0, 1012),
                 sex=c('f', 'f', 'f', 'f', 'm', 'm'),
                 bili=c(14.5, NA, 1.1, NA, 1.4, NA),
                 chol=c(261, NA, 302, NA, 176, NA))
print(p3, naprint='.')
@ 
Bilirubin and cholesterol levels at the moment of death are of course
not known, so show as NA.
A covariate which is measured at baseline but then remains constant such
as sex or age at enrollment will repeat on every line, though the coxph
function allows it to also be given only once.

Here is the timeline form of the PBC data set.
In the temp data set we retain the last follow-up time along with the
three fixed covariates of treatment, age at enrollment, and sex.

<<pbctimeline>>=
pbctime <- subset(pbcseq, select= -c(futime, status))
pbctime$status <- 0
temp <- subset(pbcseq, !duplicated(id), c(id, trt, age, sex, futime, status))
names(temp)[5] <- "day"  # change futime to day
pbctime <- merge(pbctime, temp, all=TRUE)
subset(pbctime, id<3, c(id, trt, sex, day, status, edema, bili, chol))
@ 

\section{Time dependent coefficients}
Time dependent covariates and time dependent coefficients are two
different extensions of a Cox model, as shown in the two equations
below.
\begin{align}
  \lambda(t) &= \lambda_0(t) e^{\beta X(t)} \label{tdcovar} \\
  \lambda(t) &= \lambda_0(t) e^{\beta(t) X} \label{tdbeta} 
\end{align}
Equation \eqref{tdcovar} is a time dependent covariate, a common
and well understood usage.
Equation \eqref{tdbeta} has a time dependent coefficient.
These models are much less common, but represent one way to deal with
non-proportional hazards -- the proportional hazard assumption is
precisely that  $\beta(t) = \beta$, i.e., the coefficient does not change 
over time.
The \code{cox.zph} function will plot an estimate of $\beta(t)$ for
a study and is used to diagnose and understand non-proportional
hazards.
The code below shows a test case using the veterans cancer data, with the
results plotted in figure \ref{veteran1b}.

<<veteran1>>=
options(show.signif.stars = FALSE)  # display statistical intelligence
vfit <- coxph(Surv(time, status) ~ trt + prior + karno, veteran)
vfit
quantile(veteran$karno)

zp <- cox.zph(vfit, transform= function(time) log(time +20))
zp
@

\begin{figure}
<<veteran1b, fig=TRUE, echo=FALSE>>=
plot(zp[3], resid=FALSE)    # a plot for the 3rd variable in the fit
abline(0,0, lty=3)
abline(h= vfit$coef[3], lwd=2, lty=3)
@ 
 \caption{Solid line = the estimated effect if Karnofsy score, versus time, 
   in the Veteran's cancer data set, as estimated by \code{cox.zph}, along
   with confidence intervals.  Dotted lines are the overall estimate from
   a Cox model with karno as a time-fixed effect, and a reference line at 0.}
 \label{veteran1b}
\end{figure}

Karnofsky score is a very important predictor, but its effect is not
constant over time as shown by both the test and the plot. 
Early on it has a large negative
effect: the risk for someone at the first quartile is
approximately exp(35*.05) = 5.7 fold times that of someone at
the third quartile, but by 200 days this has waned and is not much
different from one.  
One explanation is that, in this very acute illness, any measure
that is over 6 months old is no longer relevant.
Another is that a large portion of subjects with the lowest Karnofsky
values have been lost; at 200 days 4\% of the remaining subjects had
a initial value $\le 40$ versus 28\% at the start.

The proportional hazards model estimates an average hazard over time,
the value of which is shown by the dotted horizontal line.
The use of an average hazard is often reasonable, the proportional hazards
assumption is after all never precisely true.
In this case, however, the departure is quite large and a time dependent
coefficient is a more useful summary of the actual state.
The cox.zph plot is excellent for diagnosis but does not, however,
produce a formal fit of $\beta(t)$.
What if we want to fit the model?

\subsection{Step functions}
One of the simplest extensions is a step function for $\beta(t)$, i.e.,
different coefficients over different time intervals.
An easy way to do this is to use the \code{survSplit} function to
break the data set into time dependent parts.
We will arbitrarily divide the veteran's data into 3 epochs of the first
3 months, 3-6 months, and greater than 6 months.
<<split>>=
vet2 <- survSplit(Surv(time, status) ~ ., data= veteran, cut=c(90, 180), 
                  episode= "tgroup", id="id")
vet2[1:7, c("id", "tstart", "time", "status", "tgroup", "age", "karno")]
@ 
The first subject died at 72 days, his data is unchanged.
The second and third subjects contribute time to each of the three
intervals.
<<split2>>=
vfit2 <- coxph(Surv(tstart, time, status) ~ trt + prior +
                  karno:strata(tgroup), data=vet2)
vfit2
cox.zph(vfit2)
@ 
A fit to the revised data shows that the effect of baseline Karnofsky score
is essentially limited to the first two months.
The \code{cox.zph} function shows no further time dependent effect
of Karnofsky score.
This last is of course no surprise, since we used the original graph to
pick the cut points.  
A ``test'' that the coefficients for the three intervals are different
will be biased by this sequential process and must be viewed with
caution.

Survival curves post fit require a little more care.
The default curve uses the mean covariate values, 
which is always problematic and completely useless in this case.  
Look at the set of saved means for the model:
<<split3>>=
vfit2$means
@ 
The default curve will be for someone on treatment arm
\Sexpr{round(vfit2$means[1], 2)},  %$
which applies to no one,
and a single set of ``blended'' values of the Karnofsky score, each times
the three Karnofsky coefficients.
This is rectified by creating a new data set with time intervals.
We create two hypothetical subjects labeled as ``curve 1'' and ``curve 2''.
One has a Karnofsky of 40 and the other a Karnofsky of 75, both on
treatment 1 and with no prior therapy.
Each has the appropriate values for the time dependent covariate \code{tgroup},
which is 1, 2, 3 over 0--90, 90--180 and 180+ days, respectively.
We will draw curves for the first year.

<<split4, fig=TRUE>>=
quantile(veteran$karno)
cdata <- data.frame(tstart= rep(c(0,90,180), 2),
                    time =  rep(c(90,180, 365), 2),
                    status= rep(0,6),   #necessary, but ignored
                    tgroup= rep(1:3, 2),
                    trt  =  rep(1,6),
                    prior=  rep(0,6),
                    karno=  rep(c(40, 75), each=3),
                    curve=  rep(1:2, each=3))
cdata
sfit <- survfit(vfit2, newdata=cdata, id=curve)
km <- survfit(Surv(time, status) ~ I(karno>60), veteran)
plot(km, xmax= 365, col=1:2, lwd=2, 
     xlab="Days from enrollment", ylab="Survival")
lines(sfit, col=1:2, lty=2, lwd=2)
@ 

The default behavior for survival curves based on a coxph model is to
create one curve for each line in the input data; the \code{id}
option causes it to use a set of lines for each curve.
Karnofsky scores at the 25th and 75th percentiles roughly represent the 
average score for the lower half of the subjects and that for the upper half,
respectively, and are plotted over the top of the Kaplan-Meier
curves for those below and above the median.


\subsection{Continuous time-dependent coefficients}
If $\beta(t)$ is assumed to have a simple functional form we can 
fool an ordinary Cox model program in to doing the fit.
The particular form $\beta(t) = a + b\log(t)$ has for instance often
been assumed.  
Then $\beta(t) x = ax + b \log(t) x = ax + b z$ for the special time
dependent covariate $z = \log(t) x$.
The time scale for the \code{cox.zph} plot used
further above of $\log(t + 20)$ was chosen
to make the first 200 days of the plot roughly linear.
Per the figure this simple linear model does not fit over the entire range, 
but we will forge ahead and use it as an example anyway.
(After all, most users who fit the log(t) form, based on seeing it a 
paper they once read, don't bother to even look at a plot.)
An obvious but incorrect approach is
<<vfit3, echo=TRUE, eval=FALSE>>=
vfit3 <- coxph(Surv(time, status) ~ trt + prior + karno +
                I(karno * log(time + 20)), data=veteran)
@ 
This mistake has been made often enough the the \code{coxph} routine has been
updated to print an error message for such attempts.
The issue is that the above code does not actually create a time dependent 
covariate, the principle reason being that the program cannot read your mind.
There are two different times: the fixed value \code{time} in the data
set, which contains the duration to death or follow-up for each subject,
and the continuous variable $t$ that appears in the mathematical equations
as $X(t)$, $N(t)$, etc. 
As a user you were thinking of fixed duration on the left and $t$ on the right.

The code above assumes the fixed value for both, and creates a time-static 
value for each subject based on their value
for the covariate \code{time}; no differently than if we had
constructed the variable \verb+veteran$zed <- veteran$time+ 
outside of the \code{coxph} call and used \code{log(zed)} in the equation.
This variable most definitely breaks the rule about not looking into the
future, and if we fit the model one would quickly find the circularity: 
large values of 
\code{time} appear to predict long survival, because long survival leads to 
large values for \code{time}.

This same coding dichotomy exists in SAS phreg, by the way.  Adding
\code{time} to the right hand side of the model statement will create
the time-fixed (incorrect) variable,
while a programming statement within phreg that 
uses \code{time} as a variable will generate time-dependent objects.
The error is less likely there because phreg's model statement is less
flexible than R, i.e., you cannot simply write
``log(time)'' on the right hand side of a formula for instance.

To deal with this we have four approaches.
The first, but unwise idea might be to use \code{tmerge} to break the data
set into a zillion small intervals, approximating continuous time.
However, recognize that the underlying Cox likelihood will only be evaluated at
the observed event times, thus it is sufficient to split at those times:

<<vfit3b>>=
vdeath <- sort(unique(veteran$time[veteran$status==1]))
length(vdeath)
vet3 <- survSplit(Surv(time, status) ~ ., data= veteran, cut=vdeath, 
                  episode= "tgroup", id="id")
quantile(table(vet3$id))
@ 

The 97 cutpoints lead to 98 unique time intervals, and not surprisingly
an average subject appears in about half of them. 
To avoid the warning from \code{coxph} we create a second copy of the 
time variable, since the warning will be spurious in this case.

<<vfit3c>>=
vet3$xtime <- vet3$time
vfit3 <- coxph(Surv(tstart, time, status) ~ trt + prior + karno +
                   I(karno * log(xtime+20)), vet3)
plot(zp[3])
abline(coef(vfit3)[3:4], col=2, lty=2)
@ 

The time dependent coefficient is estimated to be
$\beta(t) =$ \Sexpr{round(coef(vfit3)[3], 3)} + 
 \Sexpr{round(coef(vfit3)[4], 3)} * log(t + 20).
We can add said line to the \code{cox.zph} plot.
Not surprisingly, the result is rather too high for time $>$ 300
and underestimates the steeper part of the slope.
Still the fit is better than a horizontal line, 
i.e., a proportional hazards assumption, as confirmed
by the p-value for the slope term in \code{vfit3}.
(The p-value from cox.zph for karno is nearly identical, as it
must be, since the tests in cox.zph are for a linear effect on the
chosen time scale.)

The above can be done automatically by using the \emph{time-transform}
functionality of coxph.
<<vet3d>>=
vfit3b <-  coxph(Surv(time, status) ~ trt + prior + karno + tt(karno),
                data=veteran,
                tt = function(x, t, ...) x * log(t+20))
rbind(coef(vfit3b), coef(vfit3))
@ 

Realize that it is not necessary to use all 97 cut points: 50 or even 25
will be more than sufficient to get an accurate approximation. 
This is particularly important with large data sets, there are for instance
\Sexpr{length(unique(nafld2$days))} unique days with a test result and
\Sexpr{nrow(nafld1)} subjects, leading to a potential expanded data set of 
over 100 million observations.  Think first and use a rational grid of
cutpoints for \code{survSplit}.

We no longer recommend the use of the tt functionality for this reason,
it always does the full expansion. Another reason is that the resulting fit
is a dead end: the expanded data set created internally
to coxph is not available after the fit, so residuals and predicted values
are not available --- they would be residuals for exactly what observations?

As compared to fitting a single term that is linear in a predetermined
function of time, i.e., the test for PH done by \code{cox.zph},
a more general approach is to allow a flexible fit on the right hand
side, mimicing the approximate plot performed by \code{plot.cox.zph}.
This is shown in the following.

<<vet4>>=
ns2 <- function(x) ns(x, knots=c(100, 200), Boundary.knots = c(5,400))
vfit4 <-  coxph(Surv(tstart,time, status) ~ trt + prior + karno + 
                    karno:ns2(xtime), data=vet3)
vfit4

# termplot does not create sensible estimates for interactions (per its
#  help page), so do it by hand for the karno term
smat <- cbind(1, ns2(5:400)) # a range of plotting values
estimate <- drop(smat %*% coef(vfit4)[3:6])
sd.est <- sqrt(diag(smat %*% vcov(vfit4)[3:6, 3:6] %*% t(smat)))
yy <- estimate + outer(sd.est, c(0, -1.96, 1.96), "*")
matplot(5:400/30.5, yy, type='l', lty=c(1,2,2), col=1, lwd=c(2,1,1),
        xlab="Months", ylab= "Estimated beta(t) for karno")
abline(0,0, lty=3)
@

We see a large decrease in effect over the first 5 months, to near zero;
with very wide CI for any changes thereafter.

The use of programming statements to create a yes/no variable deserves
extra comment because of a common error.
Assume one has a data set with two time variables: the time from enrollment
to last follow up and the time to diabetes, say, and we want to use the
presence of diabetes as time dependent covariate.
Here is a small made up example.
<<ties1>>=
data1 <- read.table(col.names=c("id", "diabetes", "lfu", "status"),
                          header=FALSE, text="
1   5 30  1
2  10 15  1
3  NA 60  0
4  NA 80  1
5  10 80  0
6  NA 90  1
7  30 95  1
")

data1$d2 <- pmin(data1$diabetes, 300, na.rm=TRUE) #replace NA with 300
fit1 <- coxph(Surv(lfu, status) ~ tt(d2), data=data1,
              tt = function(d2, t, ...) ifelse(t > d2, 1, 0))
fit2 <- coxph(Surv(lfu, status) ~ tt(d2), data=data1,
              tt = function(d2, t, ...) ifelse(t < d2, 0, 1))
c(coef(fit1), coef(fit2))
@

One might have expected fit1 and fit2 to give the same coefficient but 
they are completely different. 
The issue is subject 7 whose time to diabetes falls exactly at one of the
event times.  In fit1 their diabetes covariate effectively changes just 
after the event at time 30,
in fit2 it changes at the event.  The second is incorrect.
Stated in the gambling context, all roulette bets must be placed
\emph{before} the ball lands in the slot, or they apply to the next spin of
the wheel.  
Likewise, a change in diabetes status needs to apply to the next event time.

When a data set is expanded into a start, stop format using the
\code{tmerge} function ties are dealt with correctly.
<<ties2>>=
data2 <- tmerge(data1, data1, id=id, dstat=event(lfu, status),
                diab = tdc(diabetes))
subset(data2, id %in% c(1,7), c(id, tstart:diab))               
fit3 <- coxph(Surv(tstart, tstop, dstat) ~ diab, data2)
c(coef(fit1), coef(fit2), coef(fit3))
@ 


\section{Predictable time-dependent covariates}
Occasionally one has a time-dependent covariate whose values in the future
are predictable.
The most obvious of these is patient age, occasionally this may also be
true for the cumulative dose of a drug; say that subjects are assigned to
take one 5mg tablet a day (and they keep their schedule).
If age is entered as a linear term in the model, then the effect of changing
age can be ignored in a Cox model, due to the structure of the partial
likelihood.  Assume that subject $i$ has an event at time $t_i$, with other
subject $j \in R_i$ at risk at that time, with $a$ denoting age.
The partial likelihood term is
\begin{equation*}
  \frac{e^{\beta * a_i}}{\sum_{j \in R_i} e^{\beta* a_j}} =
  \frac{e^{\beta * (a_i + t_i)}}{\sum_{j \in R_i} e^{\beta* (a_j + t_i)}} 
\end{equation*}
We see that using current age (the right hand version) or age at
baseline (left hand), the partial likelihood term is identical since
$\exp(\beta t_i)$ cancels out of the fraction.
However, if the effect of age on risk is \emph{non-linear}, this
cancellation does not occur.

Since age changes continuously, we would in theory need a very large
data set to completely capture the effect, one interval
per day to match the usual resolution for death times.
In practice this level of resolution is not necessary; though we all
grow older, risk does not increase so rapidly that we need to know
our age to the day. 
As an example, below we fit twice, once divided at every death time,
and once with an update per year.

<<expand>>=
dtimes <- sort(unique(with(pbc, time[status==2])))
tdata1 <- survSplit(Surv(time, status==2) ~., pbc, cut=dtimes)
tdata1$c.age <- tdata1$age + tdata1$time/365.25 -50 #current age, centered at 50
pfit1 <- coxph(Surv(tstart, time, event) ~ log(bili) + ascites + c.age +
    I(c.age^2) + I(c.age^3), data=tdata1)

tdata2 <- survSplit(Surv(time, status==2) ~., pbc, cut=1:12 * 365.25)
tdata2$c.age <- tdata2$age + tdata2$time/365.25 -50 #current age, centered at 50
pfit2 <- coxph(Surv(tstart, time, event) ~ log(bili) + ascites + c.age +
    I(c.age^2) + I(c.age^3), data=tdata2)

rbind(coef(pfit1), coef(pfit2))
c(nrow(pbc), nrow(tdata1), nrow(tdata2))
@

We have almost the same result but at 1/20th the data set size.  
This can make a substantial difference in compute time for large data sets.
Exactly how much to `thin down' the list of cutpoints can be based on 
subject specific knowledge, one could choose selected points based on a
\code{cox.zph} plot, or start with a very coarse time grid and then sequentially 
refine it.
In this data set we reasoned that although risk increases with age, a
age change of less than one year would have minimal impact.

For the veteran data set 131/137 of the observation times are at 13 months or
less.  Look at cutting this into 14 different points:the interval starting at
0, at 1 month, at 2 months, \ldots, 13 for construction of the time-dependent
Karnofsy coefficient.  This does not categorize the follow-up times themselves,
only the time-dependent effect, so we don't lose precision in the response.

<<veteran3, fig=TRUE>>=
dtime <- round(1:13 * 30.5)
vdata2 <- survSplit(Surv(time, status) ~ ., veteran, cut=dtime,
                    episode= "month")
vfit1 <- coxph(Surv(tstart, time, status) ~ trt + prior + karno, vdata2)
vfit5 <- coxph(Surv(tstart, time, status) ~ trt + prior + karno + 
                   karno:ns(month, df=3), vdata2)
anova(vfit1, vfit5)

tdata <- expand.grid(trt=0, prior=0, karno=30, month=seq(1,13, length=50))
yhat <- predict(vfit5, newdata=tdata, se.fit=TRUE, reference="zero")
yy <- yhat$fit+ outer(yhat$se.fit, c(0, -1.96, 1.96), '*')
matplot(seq(1,13, length=50), yy, type='l', lty=c(1,2,2), col=1, lwd=c(2,1,1),
        xlab="Month of fu", ylab="Effect, Karnofsky 60 vs 90")
@ 

Because the spline is now a direct part of the model downstream
functions such as \code{anova} work properly, and it is straightforward to
create a plot of the of the fit.
A Cox model produces \emph{relative} risk, and the default for the predict
function is to use the mean covariates as our reference.  We wanted to
assess the effect of a 30 point change in Karnofsy, holding the other covariates
fixed: values of 0 plus a reference of 0 accomplished this.

For these small data sets the tt() approach and a subset of times are both
feasable, but for a large data set the first can become very slow, or even
run out of resources.  The reason is that a
a Cox model fit requires computation of a weighted mean and variance of
the covariates at each event time, a process that is inherently 
$O(ndp^2)$ where $n$ = the sample size, $d$ = the number of events and $p$=
the number of covariates.  
Much of the algorithmic effort in the underlying code is to use clever 
updating methods for
the mean and variance matrices, reducing the compute time to $O(np + d p^2)$.
For a large data set the tt() approach can produce an intermediate data
set with $O(nd)$ rows, however, overcoming the computation.
For even moderate data sets the impact of $(nd)p$ vs $np$ can be surprising.

There are other interesting uses for the time-transform capability.
One example is O'Brien's logit-rank test procedure \cite{Obrien78}.
He proposed replacing the covariate at each event time with a logit
transform of its ranks.  
This removes the influence of any outliers in the predictor $x$.
For this case we ignore the event time argument and concentrate on the
groupings to create the following tt function.
<<>>=
function(x, t, riskset, weights){ 
    obrien <- function(x) {
        r <- rank(x)
        (r-.5)/(.5+length(r)-r)
    }
    unlist(tapply(x, riskset, obrien))
}
@ 
This relies on the fact that the input arguments to tt() are ordered by the 
event number or risk set.
This function is used as a default if no tt argument is present in the
coxph call, but there are tt terms in the model formula.
(Doing so allowed me to depreciate the survobrien function).

Another interesting usage is to replace the data by simple ranks, not
rescaled to 0--1.
<<>>=
function(x, t, riskset, weights) 
    unlist(tapply(x, riskset, rank))
@ 
The score statistic for this model is $(C-D)/2$, where $C$ and $D$ are the
number of concordant and discordant pairs, see the
concordance function.  
The score statistic from this fit is then a test for significance of the
concordance statistic, and is in fact the basis for an approximate 
standard error for the concordance.
The O'Brien test can be viewed as concordance statistic that gives equal  %'
weight to each event time, whereas the standard concordance weights each
event proportionally to the size of the risk set.

\bibliographystyle{plain}
\bibliography{refer}
\end{document}