\documentclass{article}[11pt]
\usepackage{Sweave}
\usepackage{amsmath}
\addtolength{\textwidth}{1in}
\addtolength{\oddsidemargin}{-.5in}
\setlength{\evensidemargin}{\oddsidemargin}
%\VignetteIndexEntry{Model frames and the Survival package}
\newcommand{\code}[1]{\texttt{#1}}

\title{Model frames and the Survival package}
\author{Terry M Therneau}

\begin{document}
  \maketitle

The modeling functions in the survival package (aareg, coxph, pyears, survexp,
survfit, and survreg) differ from almost every other package in R by having
\code{model=FALSE} as the default.
This has two major consequences for a user.
\begin{itemize}
  \item When a follow up computation arises that needs something else from the
data frame, something that was not saved, the routine will need to rebuild the
model frame.  Examples are a survival curve after a coxph model, and certain 
predicted values and residuals.
  \item In some cases, the data cannot be reconstructed.
    \begin{itemize}
      \item The most common is when the data can't be found, itself usually a
        complex function of how R searches for things.  This often occurs, for
        instance, if a call to coxph is within another function, and the 
        coxph formula
        uses both local variables and variables via the \code{data=} option.
      \item The data might be gone (or worse, changed).
    \end{itemize}
\end{itemize}
Personally I only infrequently get caught by this. The solution
is quite simple, which is to refit the model adding the \code{model=TRUE} option.
Some more modern flavors of R may have this arise more often, I do not know.

Why have I chosen this route?  
First, let me point out that I didn't actually ``do'' anything.
For the first half of the survival package's life lm() and glm() did exactly
the same thing. And I will admit that inertia (and a bit of stubborness) is 
certainly one of the reasons for standing pat.

However, by far the largest motivation for not changing is confidentiality. 
I work in a major medical center (Mayo Clinic) on medical research, 
using real patient data.
We take the issue of data confidentiality very seriously, and are quite
careful with respect to where the data is stored and who has access to it. 
(One certain way to be fired at Mayo, and by that I mean ``walked to the exit
by security on that same day'', is to inappropriately access or share patient
data.)
Yet each saved copy of an lm() (or random forest or whatever) model in R
carries a silent, complete, and unencrypted copy of the data used to fit it;
something of which most users remain blissfully unaware. If a model has per
subject random effects or a robust variance, then \code{id clinic} was likely
part of the call, i.e., each patient's personal identifier in our history system
is also in the model data.

This makes me very nervous. Observing the increasingly sophisticated attacks
on our institution's IT structure only adds to it.
Perhaps my stance on the survival package is only one
finger in a very leaky dam, but I'm not ready to join the problem.
I am also fatalistic in suspecting that whomever takes on this package when
I step away will likely make this one of their first changes.  (I'm 73, so not
too far away.)

The second argument against change is size. I first caught on to this over a
decade ago when I got a message that I was consuming far too much of our 
deparment disk space, which puzzled me. It turned out that I had been fitting
several exploratory models to a very very large data set and
saving the results (not survival models) and my .RData had exploded.
Given the constant increase in available data storage this argument is less
cogent than it once was, but I started my career scratching for every byte,
and limited consumption is a hard habit to break once pounded in so deeply.

\end{document}