-
Notifications
You must be signed in to change notification settings - Fork 34
/
Copy path11-survival-analysis-and-censored-data.Rmd
540 lines (423 loc) · 17.9 KB
/
11-survival-analysis-and-censored-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
# Survival Analysis and Censored Data
## Conceptual
### Question 1
> For each example, state whether or not the censoring mechanism is independent.
> Justify your answer.
>
> a. In a study of disease relapse, due to a careless research scientist, all
> patients whose phone numbers begin with the number "2" are lost to follow up.
Independent. There's no reason to think disease relapse should be related to
the first digit of a phone number.
> b. In a study of longevity, a formatting error causes all patient ages that
> exceed 99 years to be lost (i.e. we know that those patients are more than 99
> years old, but we do not know their exact ages).
Not independent. Older patients are more likely to see an event that younger.
> c. Hospital A conducts a study of longevity. However, very sick patients tend
> to be transferred to Hospital B, and are lost to follow up.
Not independent. Sick patients are more likely to see an event that healthy.
> d. In a study of unemployment duration, the people who find work earlier are
> less motivated to stay in touch with study investigators, and therefore are
> more likely to be lost to follow up.
Not independent. More employable individuals are more likely to see an event.
> e. In a study of pregnancy duration, women who deliver their babies pre-term
> are more likely to do so away from their usual hospital, and thus are more
> likely to be censored, relative to women who deliver full-term babies.
Not independent. Delivery away from hospital will be associated with pregnancy
duration.
> f. A researcher wishes to model the number of years of education of the
> residents of a small town. Residents who enroll in college out of town are
> more likely to be lost to follow up, and are also more likely to attend
> graduate school, relative to those who attend college in town.
Not independent. Years of education will be associated with enrolling in out
of town colleges.
> g. Researchers conduct a study of disease-free survival (i.e. time until
> disease relapse following treatment). Patients who have not relapsed within
> five years are considered to be cured, and thus their survival time is
> censored at five years.
In other words we assume all events happen within five years, so
censoring after this time is equivalent to not censoring at all so
the censoring is independent.
> h. We wish to model the failure time for some electrical component. This
> component can be manufactured in Iowa or in Pittsburgh, with no difference in
> quality. The Iowa factory opened five years ago, and so components
> manufactured in Iowa are censored at five years. The Pittsburgh factory opened
> two years ago, so those components are censored at two years.
If there is no difference in quality then location and therefore censoring is
independent of failure time.
> i. We wish to model the failure time of an electrical component made in two
> different factories, one of which opened before the other. We have reason to
> believe that the components manufactured in the factory that opened earlier
> are of higher quality.
In this case, the difference in opening times of the two locations will mean
that any difference in quality between locations will be associated with
censoring, so censoring is not independent.
### Question 2
> We conduct a study with $n = 4$ participants who have just purchased cell
> phones, in order to model the time until phone replacement. The first
> participant replaces her phone after 1.2 years. The second participant still
> has not replaced her phone at the end of the two-year study period. The third
> participant changes her phone number and is lost to follow up (but has not yet
> replaced her phone) 1.5 years into the study. The fourth participant replaces
> her phone after 0.2 years.
>
> For each of the four participants ($i = 1,..., 4$), answer the following
> questions using the notation introduced in Section 11.1:
>
> a. Is the participant's cell phone replacement time censored?
No, Yes, Yes and No. Censoring occurs when we do not know if or when the phone
was replaced.
> b. Is the value of $c_i$ known, and if so, then what is it?
$c_i$ is censoring time. For the four participants these are: NA. 2. 1.5 and NA.
> c. Is the value of $t_i$ known, and if so, then what is it?
$t_i$ is time to event. For the four participants these are: 1.2, NA, NA and
0.2.
> d. Is the value of $y_i$ known, and if so, then what is it?
$y_i$ is the observed time. For the four participants these are: 1.2, 2, 1.5 and
0.2.
> e. Is the value of $\delta_i$ known, and if so, then what is it?
$\delta_i$ is an indicator for censoring. The nomenclature introduced here
defines this to be 1 if we observe the true "survival" time and 0 if we observe
the censored time. Therefore, for these participants, the values are: 1, 0, 0
and 1.
### Question 3
> For the example in Exercise 2, report the values of $K$, $d_1,...,d_K$,
> $r_1,...,r_K$, and $q_1,...,q_K$, where this notation was defined in Section
> 11.3.
* $K$ is the number of unique deaths, which is 2.
* $d_k$ represents the unique death times, which are: 0.2, 1.2.
* $r_k$ denotes the number of patients alive and in the study just before $d_k$.
Note the first event is for patient 4, then patient 1, then patient 3 is
censored and finally the study ends with patient 2 still involved. Therefore
$r_k$ takes values are: 4, 3.
* $q_k$ denotes the number of patients who died at time $d_k$, therefore this
takes values: 1, 1.
We can check by using the `survival` package.
```{r}
library(survival)
x <- Surv(c(1.2, 2, 1.5, 0.2), event = c(1, 0, 0, 1))
summary(survfit(x ~ 1))
```
### Question 4
> This problem makes use of the Kaplan-Meier survival curve displayed in Figure
> 11.9. The raw data that went into plotting this survival curve is given in
> Table 11.4. The covariate column of that table is not needed for this problem.
>
> a. What is the estimated probability of survival past 50 days?
There are 2 events that happen before 50 days. The number at
risk $r_k$ are 5 and 4 (one was censored early on), thus survival probability is
$4/5 * 3/4 = 0.6$.
Equivalently, we can use the survival package.
```{r, message = FALSE, warning = FALSE}
library(tidyverse)
```
```{r}
table_data <- tribble(
~Y, ~D, ~X,
26.5, 1, 0.1,
37.2, 1, 11,
57.3, 1, -0.3,
90.8, 0, 2.8,
20.2, 0, 1.8,
89.8, 0, 0.4
)
x <- Surv(table_data$Y, table_data$D)
summary(survfit(x ~ 1))
```
> b. Write out an analytical expression for the estimated survival function. For
> instance, your answer might be something along the lines of
>
> $$
> \hat{S}(t) = \begin{cases}
> 0.8 & \text{if } t < 31\\
> 0.5 & \text{if } 31 \le t < 77\\
> 0.22 & \text{if } 77 \le t
> \end{cases}
> $$
>
> (The previous equation is for illustration only: it is not the correct
> answer!)
$$
\hat{S}(t) = \begin{cases}
1 & \text{if } t < 26.5 \\
0.8 & \text{if } 26.5 \le t < 37.2 \\
0.6 & \text{if } 37.2 \le t < 57.3 \\
0.4 & \text{if } 57.3 \le t
\end{cases}
$$
### Question 5
> Sketch the survival function given by the equation
>
> $$
> \hat{S}(t) = \begin{cases}
> 0.8, & \text{if } t < 31\\
> 0.5, & \text{if } 31 \le t < 77\\
> 0.22 & \text{if } 77 \le t
> \end{cases}
> $$
>
> Your answer should look something like Figure 11.9.
We can draw this plot, or even engineer data that will generate the required
plot...
```{r}
plot(NULL,
xlim = c(0, 100),
ylim = c(0, 1),
ylab = "Estimated Probability of Survival",
xlab = "Time in Days"
)
lines(
c(0, 31, 31, 77, 77, 100),
c(0.8, 0.8, 0.5, 0.5, 0.22, 0.22)
)
```
### Question 6
> This problem makes use of the data displayed in Figure 11.1. In completing
> this problem, you can refer to the observation times as $y_1,...,y_4$. The
> ordering of these observation times can be seen from Figure 11.1; their exact
> values are not required.
>
> a. Report the values of $\delta_1,...,\delta_4$, $K$, $d_1,...,d_K$,
> $r_1,...,r_K$, and $q_1,...,q_K$. The relevant notation is defined in Sections
> 11.1 and 11.3.
* $\delta$ values are: 1, 0, 1, 0.
* $K$ is 2
* $d$ values are $y_3$ and $y_1$.
* $r$ values are 4 and 2.
* $q$ values are 1 and 1.
> b. Sketch the Kaplan-Meier survival curve corresponding to this data set. (You
> do not need to use any software to do this---you can sketch it by hand using
> the results obtained in (a).)
```{r}
plot(NULL,
xlim = c(0, 350),
ylim = c(0, 1),
ylab = "Estimated Probability of Survival",
xlab = "Time in Days"
)
lines(
c(0, 150, 150, 300, 300, 350),
c(1, 1, 0.75, 0.75, 0.375, 0.375)
)
```
x <- Surv(c(300, 350, 150, 250), c(1, 0, 1, 0))
> c. Based on the survival curve estimated in (b), what is the probability that
> the event occurs within 200 days? What is the probability that the event does
> not occur within 310 days?
0.25 and 0.375.
> d. Write out an expression for the estimated survival curve from (b).
$$
\hat{S}(t) = \begin{cases}
1 & \text{if } t < y_3 \\
0.75 & \text{if } y_3 \le t < y_1 \\
0.375 & \text{if } y_1 \le t
\end{cases}
$$
### Question 7
> In this problem, we will derive (11.5) and (11.6), which are needed for the
> construction of the log-rank test statistic (11.8). Recall the notation in
> Table 11.1.
>
> a. Assume that there is no difference between the survival functions of the
> two groups. Then we can think of $q_{1k}$ as the number of failures if we draw
> $r_{1k} observations, without replacement, from a risk set of $r_k$
> observations that contains a total of $q_k$ failures. Argue that $q_{1k}$
> follows a hypergeometric distribution. Write the parameters of this
> distribution in terms of $r_{1k}$, $r_k$, and $q_k$.
A hypergeometric distributions models sampling without replacement from a finite
pool where each sample is a success or failure. This fits the situation here,
where with have a finite number of samples in the risk set.
The hypergeometric distribution is parameterized as $k$ successes in $n$ draws, without replacement, from a population of size $N$ with $K$ objects with that feature.
Mapping to our situation, $q_{1k}$ is $k$, $r_{1k}$ is $n$, $r_k$ is $N$ and $q_k$ is $K$.
> b. Given your previous answer, and the properties of the hypergeometric
> distribution, what are the mean and variance of $q_{1k}$? Compare your answer
> to (11.5) and (11.6).
With the above parameterization, the mean ($n K/N$) is $r_{1k} q_k/r_K$.
The variance $n K/N (N-K)/N (N-n)/(N-1)$ is
$$
r_{1k} \frac{q_k}{r_k} \frac{r_k-q_k}{r_k} \frac{r_k - r_{1k}}{r_k - 1}
$$
These are equivalent to 11.5 and 11.6.
### Question 8
> Recall that the survival function $S(t)$, the hazard function $h(t)$, and the
> density function $f(t)$ are defined in (11.2), (11.9), and (11.11),
> respectively. Furthermore, define $F(t) = 1 - S(t)$. Show that the following
> relationships hold:
>
> $$
> f(t) = dF(t)/dt \\
> S(t) = \exp\left(-\int_0^t h(u)du\right)
> $$
If $F(t) = 1 - S(t)$, then $F(t)$ is the *cumulative density function* (cdf)
for $t$.
For a continuous distribution, a cdf, e.g. $F(t)$ can be expressed as an
integral (up to some value $x$) of the *probability density function* (pdf),
i.e. $F(t) = \int_{-\infty}^x f(x) dt$. Equivalently, the derivative of the cdf
is its pdf: $f(t) = \frac{d F(t)}{dt}$.
Then,
$h(t) = \frac{f(t)}{S(t)} = \frac{dF(t)/dt}{S(t)} = \frac{-dS(t)/dt}{S(t)}$.
From basic calculus, this can be rewritten as $h(t) = -\frac{d}{dt}\log{S(t)}$.
Integrating and then exponentiating we get the second identity.
### Question 9
> In this exercise, we will explore the consequences of assuming that the
> survival times follow an exponential distribution.
>
> a. Suppose that a survival time follows an $Exp(\lambda)$ distribution, so
> that its density function is $f(t) = \lambda\exp(-\lambda t)$. Using the
> relationships provided in Exercise 8, show that $S(t) = \exp(-\lambda t)$.
The cdf of an exponential distribution is $1 - \exp(-\lambda x)$ and
$S(t)$ is $1 - F(t)$ where $F(t)$ is the cdf.
Hence, $S(t) = \exp(-\lambda t)$.
> b. Now suppose that each of $n$ independent survival times follows an
> $\exp(\lambda)$ distribution. Write out an expression for the likelihood
> function (11.13).
The reference to (11.13) gives us the following formula:
$$
L = \prod_{i=1}^{n} h(y_i)^{\delta_i} S(y_i)
$$
(11.10) also gives us
$$
h(t) = \frac{f(t)}{S(t)}
$$
Plugging in the expressions from part (a), we get
\begin{align*}
h(t) &= \frac{\lambda \exp(- \lambda t)}{\exp(- \lambda t)} \\
&= \lambda
\end{align*}
Using (11.13), we get the following loss expression:
$$
\ell = \prod_i \lambda^{\delta_i} e^{- \lambda y_i}
$$
> c. Show that the maximum likelihood estimator for $\lambda$ is
> $$
> \hat\lambda = \sum_{i=1}^n \delta_i / \sum_{i=1}^n y_i.
> $$
Take the log likelihood.
\begin{align*}
\log \ell &= \sum_i \log \left( \lambda^{\delta_i} e^{- \lambda y_i} \right) \\
&= \sum_i{\delta_i\log\lambda - \lambda y_i \log e} \\
&= \sum_i{\delta_i\log\lambda - \lambda y_i} \\
&= \log\lambda\sum_i{\delta_i} - \lambda\sum_i{y_i}
\end{align*}
Differentiating this expression with respect to $\lambda$ we get:
$$
\frac{d \log \ell}{d \lambda} = \frac{\sum_i{\delta_i}}{\lambda} - \sum_i{y_i}
$$
This function maximises when its gradient is 0. Solving for this gives a MLE of
$\hat\lambda = \sum_{i=1}^n \delta_i / \sum_{i=1}^n y_i$.
> d. Use your answer to (c) to derive an estimator of the mean survival time.
>
> _Hint: For (d), recall that the mean of an $Exp(\lambda)$ random variable is
> $1/\lambda$._
Estimated mean survival would be $1/\lambda$ which given the above would be
$\sum_{i=1}^n y_i / \sum_{i=1}^n \delta_i$, which can be thought of as
the total observation time over the total number of deaths.
## Applied
### Question 10
> This exercise focuses on the brain tumor data, which is included in the
> `ISLR2` `R` library.
>
> a. Plot the Kaplan-Meier survival curve with ±1 standard error bands, using
> the `survfit()` function in the `survival` package.
```{r}
library(ISLR2)
x <- Surv(BrainCancer$time, BrainCancer$status)
plot(survfit(x ~ 1),
xlab = "Months",
ylab = "Estimated Probability of Survival",
col = "steelblue",
conf.int = 0.67
)
```
> b. Draw a bootstrap sample of size $n = 88$ from the pairs ($y_i$,
> $\delta_i$), and compute the resulting Kaplan-Meier survival curve. Repeat
> this process $B = 200$ times. Use the results to obtain an estimate of the
> standard error of the Kaplan-Meier survival curve at each timepoint. Compare
> this to the standard errors obtained in (a).
```{r}
plot(survfit(x ~ 1),
xlab = "Months",
ylab = "Estimated Probability of Survival",
col = "steelblue",
conf.int = 0.67
)
fit <- survfit(x ~ 1)
dat <- tibble(time = c(0, fit$time))
for (i in 1:200) {
y <- survfit(sample(x, 88, replace = TRUE) ~ 1)
y <- tibble(time = c(0, y$time), "s{i}" := c(1, y$surv))
dat <- left_join(dat, y, by = "time")
}
res <- fill(dat, starts_with("s")) |>
rowwise() |>
transmute(sd = sd(c_across(starts_with("s"))))
se <- res$sd[2:nrow(res)]
lines(fit$time, fit$surv - se, lty = 2, col = "red")
lines(fit$time, fit$surv + se, lty = 2, col = "red")
```
> c. Fit a Cox proportional hazards model that uses all of the predictors to
> predict survival. Summarize the main findings.
```{r}
fit <- coxph(Surv(time, status) ~ sex + diagnosis + loc + ki + gtv + stereo, data = BrainCancer)
fit
```
`diagnosisHG` and `ki` are highly significant.
> d. Stratify the data by the value of `ki`. (Since only one observation has
> `ki=40`, you can group that observation together with the observations that
> have `ki=60`.) Plot Kaplan-Meier survival curves for each of the five strata,
> adjusted for the other predictors.
To adjust for other predictors, we fit a model that includes those predictors
and use this model to predict new, artificial, data where we allow `ki` to
take each possible value, but set the other predictors to be the mode or mean
of the other predictors.
```{r}
library(ggfortify)
modaldata <- data.frame(
sex = rep("Female", 5),
diagnosis = rep("Meningioma", 5),
loc = rep("Supratentorial", 5),
ki = c(60, 70, 80, 90, 100),
gtv = rep(mean(BrainCancer$gtv), 5),
stereo = rep("SRT", 5)
)
survplots <- survfit(fit, newdata = modaldata)
plot(survplots, xlab = "Months", ylab = "Survival Probability", col = 2:6)
legend("bottomleft", c("60", "70", "80", "90", "100"), col = 2:6, lty = 1)
```
### Question 11
> This example makes use of the data in Table 11.4.
>
> a. Create two groups of observations. In Group 1, $X < 2$, whereas in Group 2,
> $X \ge 2$. Plot the Kaplan-Meier survival curves corresponding to the two
> groups. Be sure to label the curves so that it is clear which curve
> corresponds to which group. By eye, does there appear to be a difference
> between the two groups' survival curves?
```{r}
x <- split(Surv(table_data$Y, table_data$D), table_data$X < 2)
plot(NULL, xlim = c(0, 100), ylim = c(0, 1), ylab = "Survival Probability")
lines(survfit(x[[1]] ~ 1), conf.int = FALSE, col = 2)
lines(survfit(x[[2]] ~ 1), conf.int = FALSE, col = 3)
legend("bottomleft", c(">= 2", "<2"), col = 2:3, lty = 1)
```
There does not appear to be any difference between the curves.
> b. Fit Cox's proportional hazards model, using the group indicator as a
> covariate. What is the estimated coefficient? Write a sentence providing the
> interpretation of this coefficient, in terms of the hazard or the
> instantaneous probability of the event. Is there evidence that the true
> coefficient value is non-zero?
```{r}
fit <- coxph(Surv(Y, D) ~ X < 2, data = table_data)
fit
```
The coefficient is $0.3401$. This implies a slightly increased hazard when
$X < 2$ but it is not significantly different to zero (P = 0.8).
> c. Recall from Section 11.5.2 that in the case of a single binary covariate,
> the log-rank test statistic should be identical to the score statistic for the
> Cox model. Conduct a log-rank test to determine whether there is a difference
> between the survival curves for the two groups. How does the p-value for the
> log-rank test statistic compare to the $p$-value for the score statistic for
> the Cox model from (b)?
```{r}
summary(fit)$sctest
survdiff(Surv(Y, D) ~ X < 2, data = table_data)$chisq
```
They are identical.