logo
Example processing data into an e-value sequence for ALL-IN-META-BCG-CORONA
library(devtools)
devtools::install_github("AlexanderLyNL/Safestats", ref = "logrank")
library(safestats)
library(survival)
library(knitr)

An example data set

An example data set in the requested format for the ALL-IN-META-BCG-CORONA is shown below.

Please find the details to generate this toy data set in the complete R markdown file here.

intervention dateRand hospital COV19 dateCOV19 COV19hosp dateCOV19hosp dateLastFup
control 2020-05-07 A yes 2020-05-11 yes 2020-05-15 2020-06-23
control 2020-05-04 B yes 2020-05-08 yes 2020-05-12 2020-06-23
BCG 2020-05-08 A yes 2020-05-21 yes 2020-06-01 2020-06-23
control 2020-05-07 B yes 2020-05-25 no NA 2020-06-23
BCG 2020-05-05 A yes 2020-05-24 no NA 2020-06-23
BCG 2020-05-10 B yes 2020-06-03 no NA 2020-06-23
control 2020-05-14 A yes 2020-06-23 no NA 2020-06-23
control 2020-05-10 B no NA no NA 2020-06-23
BCG 2020-05-08 A no NA no NA 2020-06-23
BCG 2020-05-04 B no NA no NA 2020-06-23

Preprocessing

We need to be sure that all date variables are actual dates, such that we can substract them from each other to obtain event and censoring times.

data$dateRand <- as.Date(data$dateRand, format = "%Y-%m-%d")
data$dateLastFup <- as.Date(data$dateLastFup, format = "%Y-%m-%d")
data$dateCOV19 <- as.Date(data$dateCOV19, format = "%Y-%m-%d")
data$dateCOV19hosp <- as.Date(data$dateCOV19hosp, format = "%Y-%m-%d")

If your dates are formatted in a different way (e.g. "%d-%m-%y" or "%y-%m-%d"), make sure to process them correctly.

Also, make sure that the intervention variable is coded the right way around, with the control group as the first and the treatment group as the second level of the factor. That way, the hazard ratio will be defined as:

\[\begin{equation} hr = \frac{\lambda_{BCG}}{\lambda_{control}} \end{equation}\]

with a hazard ratio less than 1 showing evidence for a benefit of BCG and a hazard ratio greater than 1 showing evidence for harm.

data$intervention <- factor(data$intervention, levels = c("control", "BCG"))

Next we assign a time and status variable as is conventional in survival analysis, coding an event time as status = 2 and a censoring time as status = 1, such that we can use Surv() to assign a survival object.

startDate <- as.Date("2020-03-25")

data$timeCOV19 <- ifelse(data$COV19 %in% c("Yes", "yes"),
                         data$dateCOV19 - startDate,
                         data$dateLastFup - startDate)
data$statusCOV19 <- ifelse(data$COV19 %in% c("Yes", "yes"),
                           2, 1)
data$timeCOV19hosp <- ifelse(data$COV19hosp %in% c("Yes", "yes"),
                             data$dateCOV19hosp - startDate,
                             data$dateLastFup - startDate)
data$statusCOV19hosp <- ifelse(data$COV19hosp %in% c("Yes", "yes"),
                               2, 1)

data$survObjCOV19 <- Surv(time = data$dateRand - startDate, 
                          time2 = data$timeCOV19,
                          event = data$statusCOV19,
                          type = "counting")

## Warning: please do not proceed by creating a Surv() object based on status = 1,2
## when your data set does not contain any events. In case of no events, 
## safeLogRankTest based on a Surv() object will misinterpret all censoring times 
## coded as 1 to be events. In case of no events, your e-value is a neutral 1, 
## and you do not need to perform any further analysis.


data$survObjCOV19hosp <- Surv(time = data$dateRand - startDate, 
                              time2 = data$timeCOV19hosp,
                              event = data$statusCOV19hosp,
                              type = "counting")

The start date (time = 0) of our event and censoring times is the day the first participant of the first study (the Dutch study) was enrolled, which is startDate = 2020-03-25. This is the start date for all studies, see our Statistical Analysis Plan: we use time since the first randomized participant instead of time since a participants’ own randomization. This makes the event-times left-truncated, which is visible in the creation of the survival object of the type counting that includes the time of randomization to define when participants enter the risk set. For more details on this formulation, please refer to our Tutorial on left-truncation.

intervention dateRand hospital COV19 dateCOV19 COV19hosp dateCOV19hosp dateLastFup survObjCOV19 survObjCOV19hosp
control 2020-05-07 A yes 2020-05-11 yes 2020-05-15 2020-06-23 (43,47] (43,51]
control 2020-05-04 B yes 2020-05-08 yes 2020-05-12 2020-06-23 (40,44] (40,48]
BCG 2020-05-08 A yes 2020-05-21 yes 2020-06-01 2020-06-23 (44,57] (44,68]
control 2020-05-07 B yes 2020-05-25 no NA 2020-06-23 (43,61] (43,90+]
BCG 2020-05-05 A yes 2020-05-24 no NA 2020-06-23 (41,60] (41,90+]
BCG 2020-05-10 B yes 2020-06-03 no NA 2020-06-23 (46,70] (46,90+]
control 2020-05-14 A yes 2020-06-23 no NA 2020-06-23 (50,90] (50,90+]
control 2020-05-10 B no NA no NA 2020-06-23 (46,90+] (46,90+]
BCG 2020-05-08 A no NA no NA 2020-06-23 (44,90+] (44,90+]
BCG 2020-05-04 B no NA no NA 2020-06-23 (40,90+] (40,90+]

Specifying the design of the Safe logrank test

To agree with the design shown on the left hand side of the ALL-IN-META-BCG-CORONA dashboard, specify your design objects as follows:

designObjCOV19L <- designSafeLogrank(hrMin = 0.8,
                                     alpha = 0.1*0.025,
                                     alternative = "less",  # one-sided test hr < 1
                                     ratio = 1)
designObjCOV19L
## 
##  Safe Logrank Test Design
## 
##             minimal hazard ratio = 0.8
##                      alternative = less
##           parameter: log(thetaS) = -0.2231436
##                            alpha = 0.0025
## decision rule: e-value > 1/alpha = 400
## 
## Timestamp: 2021-06-28 17:41:11 CEST
designObjCOV19G <- designSafeLogrank(hrMin = 1/0.8,
                                     alpha = 0.1*0.025,
                                     alternative = "greater",  # one-sided test hr > 1
                                     ratio = 1)
designObjCOV19G
## 
##  Safe Logrank Test Design
## 
##             minimal hazard ratio = 1.25
##                      alternative = greater
##           parameter: log(thetaS) = 0.2231436
##                            alpha = 0.0025
## decision rule: e-value > 1/alpha = 400
## 
## Timestamp: 2021-06-28 17:41:11 CEST
designObjCOV19hospL <- designSafeLogrank(hrMin = 0.7,
                                         alpha = 0.9*0.025,
                                         alternative = "less",  # one-sided test hr < 1
                                         ratio = 1)
designObjCOV19hospL
## 
##  Safe Logrank Test Design
## 
##             minimal hazard ratio = 0.7
##                      alternative = less
##           parameter: log(thetaS) = -0.3566749
##                            alpha = 0.0225
## decision rule: e-value > 1/alpha = 44.44444
## 
## Timestamp: 2021-06-28 17:41:11 CEST
designObjCOV19hospG <- designSafeLogrank(hrMin = 1/0.7,
                                         alpha = 0.9*0.025,
                                         alternative = "greater",  # one-sided test hr > 1
                                         ratio = 1)
designObjCOV19hospG
## 
##  Safe Logrank Test Design
## 
##             minimal hazard ratio = 1.428571
##                      alternative = greater
##           parameter: log(thetaS) = 0.3566749
##                            alpha = 0.0225
## decision rule: e-value > 1/alpha = 44.44444
## 
## Timestamp: 2021-06-28 17:41:11 CEST

Calculate the e-values for a complete data set

safeLogrankTest(data$survObjCOV19 ~ data$intervention,
                designObj = designObjCOV19L)
## 
##  Safe Logrank Test
## 
## data:  data$survObjCOV19 by data$intervention (control, BCG). nEvents = 7
## 
## test: log(thetaS) = -0.22314
## e-value = 1.1513 > 1/alpha = 400 : FALSE
## alternative hypothesis: true hazard ratio is less than 1 
## 
## design: the test was designed with alpha = 0.0025
## for minimal relevant hazard ratio = 0.8 (less)
safeLogrankTest(data$survObjCOV19 ~ data$intervention,
                designObj = designObjCOV19G)
## 
##  Safe Logrank Test
## 
## data:  data$survObjCOV19 by data$intervention (control, BCG). nEvents = 7
## 
## test: log(thetaS) = 0.22314
## e-value = 0.79843 > 1/alpha = 400 : FALSE
## alternative hypothesis: true hazard ratio is greater than 1 
## 
## design: the test was designed with alpha = 0.0025
## for minimal relevant hazard ratio = 1.25 (greater)
safeLogrankTest(data$survObjCOV19hosp ~ data$intervention,
                designObj = designObjCOV19hospL)
## 
##  Safe Logrank Test
## 
## data:  data$survObjCOV19hosp by data$intervention (control, BCG). nEvents = 3
## 
## test: log(thetaS) = -0.35667
## e-value = 1.2406 > 1/alpha = 44.444 : FALSE
## alternative hypothesis: true hazard ratio is less than 1 
## 
## design: the test was designed with alpha = 0.0225
## for minimal relevant hazard ratio = 0.7 (less)
safeLogrankTest(data$survObjCOV19hosp ~ data$intervention,
                designObj = designObjCOV19hospG)
## 
##  Safe Logrank Test
## 
## data:  data$survObjCOV19hosp by data$intervention (control, BCG). nEvents = 3
## 
## test: log(thetaS) = 0.35667
## e-value = 0.73506 > 1/alpha = 44.444 : FALSE
## alternative hypothesis: true hazard ratio is greater than 1 
## 
## design: the test was designed with alpha = 0.0225
## for minimal relevant hazard ratio = 1.4286 (greater)

Retrospectively processing a data set into an e-value sequence per calendar time

We define the function procEseq(), based on the rationale of processing an e-value sequence per calendar time detailed in the two tutorials on left-truncation and staggered entry. Our analysis uses the left-truncation approach for our data on a calendar time scale:

procEseq <- function(data, eventName, statusVar, timeVar, randDateVar, eventDateVar, interventionVar,
                     designObjL, designObjG, timeScale = "calendar",
                     startDate, endDate) {
  
  eValuesL <- 
    eValuesG <- structure(rep(1,
                              times = endDate - startDate + 1),
                          names = as.character(seq.Date(from = startDate,
                                                        to = endDate, 
                                                        by = "day")))
  # before you observe any event, your e-value is 1
  # return all 1 sequence if there are no events
  if (sum(data[[statusVar]] == 2) == 0) {
    eValues <- structure(list(eValuesL, eValuesG), 
                         names = c(paste0(eventName, "L"), paste0(eventName, "G")))
    return(eValues)
  }
  
  interimCalDates <- as.character(sort(unique(data[[eventDateVar]])))
  
  # loop trough the calendar dates
  for (calDate in as.character(seq.Date(from = as.Date(interimCalDates[1]),
                                        to = endDate,
                                        by = "day"))) {
    # at dates without an event
    if (!(calDate %in% interimCalDates)) {
      # evidence stays the same as the day before
      eValuesL[calDate] <- eValuesL[as.character(as.Date(calDate) - 1)]         
      eValuesG[calDate] <- eValuesG[as.character(as.Date(calDate) - 1)]
    } else {
      # all participants that enter after today, are not yet in my data set so far
      dataSoFar <- data[data[[randDateVar]] < as.Date(calDate), ]
      
      if (timeScale == "calendar") {
        # I don't know how long participants are at risk whose events happen in the future
        # only that they are still at risk today:
        dataSoFar[[timeVar]] <- pmin(dataSoFar[[timeVar]], as.Date(calDate) - startDate)
        # the status of participants with events in the future is censored:
        dataSoFar[[statusVar]][dataSoFar[[eventDateVar]] > as.Date(calDate)] <- 1
        dataSoFar$survObj <- Surv(time = dataSoFar[[randDateVar]] - startDate, 
                                  time2 = dataSoFar[[timeVar]],
                                  event = dataSoFar[[statusVar]],
                                  type = "counting")
      }
      if (timeScale == "participant") {
        # I don't know how long participants are at risk whose events happen in the future
        # only that they are still at risk today:
        dataSoFar[[timeVar]] <- pmin(dataSoFar[[timeVar]], as.Date(calDate) - dataSoFar[[randDateVar]])
        # the status of participants with events in the future is censored:
        dataSoFar[[statusVar]][dataSoFar[[eventDateVar]] > as.Date(calDate)] <- 1
        dataSoFar$survObj <- Surv(dataSoFar[[timeVar]], dataSoFar[[statusVar]])
      }
      
      eValuesL[calDate] <- safeLogrankTest(dataSoFar$survObj ~ dataSoFar[[interventionVar]],
                                           designObj = designObjL
                                           )$eValue
      eValuesG[calDate] <- safeLogrankTest(dataSoFar$survObj ~ dataSoFar[[interventionVar]],
                                           designObj = designObjG
                                           )$eValue
    }
  }
  eValues <- structure(list(eValuesL, eValuesG), 
                       names = c(paste0(eventName, "L"), paste0(eventName, "G")))
  return(eValues)
}

This function returns a list of two e-value sequences. In the example below: one for each one-sided test based on designObjCOV19hospL and designObjCOV19hospG.

eValues <- procEseq(data = data, eventName = "COV19hosp", 
                    statusVar = "statusCOV19hosp", timeVar = "timeCOV19hosp", 
                    randDateVar = "dateRand", eventDateVar = "dateCOV19hosp", 
                    interventionVar = "intervention",
                    designObjL = designObjCOV19hospL, designObjG = designObjCOV19hospG,
                    timeScale = "calendar",
                    startDate = startDate,  
                    endDate = max(data$dateLastFup, na.rm = T))
str(eValues)
## List of 2
##  $ COV19hospL: Named num [1:91] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "names")= chr [1:91] "2020-03-25" "2020-03-26" "2020-03-27" "2020-03-28" ...
##  $ COV19hospG: Named num [1:91] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "names")= chr [1:91] "2020-03-25" "2020-03-26" "2020-03-27" "2020-03-28" ...

Sequence of e-value stratified by hospital

If we have more than one hospital in our data set, we should stratify our Safe logrank e-values by hospital and multiply the sequences:

hospitals <- unique(data$hospital)
names(hospitals) <- as.character(hospitals)

eValuesCOV19.str <- lapply(hospitals, function(hospital) 
  procEseq(data = data[data$hospital == hospital, ],
           eventName = "COV19", 
           statusVar = "statusCOV19", timeVar = "timeCOV19", 
           randDateVar = "dateRand", eventDateVar = "dateCOV19", 
           interventionVar = "intervention",
           designObjL = designObjCOV19L, designObjG = designObjCOV19G,
           timeScale = "calendar",
           startDate = as.Date("2020-03-25"),  # startDate defined above  
           endDate = max(data$dateLastFup, na.rm = T)))

eValuesCOV19hosp.str <- lapply(hospitals, function(hospital) 
  procEseq(data = data[data$hospital == hospital, ],
           eventName = "COV19hosp", 
           statusVar = "statusCOV19hosp", timeVar = "timeCOV19hosp", 
           randDateVar = "dateRand", eventDateVar = "dateCOV19hosp", 
           interventionVar = "intervention",
           designObjL = designObjCOV19hospL, designObjG = designObjCOV19hospG,
           timeScale = "calendar",
           startDate = as.Date("2020-03-25"),  # startDate defined above
           endDate = max(data$dateLastFup, na.rm = T)))

The above use of the function lapply() filters the data set based on hospital and then applies the function procEseq() that processes the data of participants belonging to each hospital into a sequence of e-values. The result is a list of lists: for each hospital in the list, a list of two e-value sequences for the two one-sided tests. Here the hospitals have names A, B.

str(eValuesCOV19.str)
## List of 2
##  $ A:List of 2
##   ..$ COV19L: Named num [1:91] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "names")= chr [1:91] "2020-03-25" "2020-03-26" "2020-03-27" "2020-03-28" ...
##   ..$ COV19G: Named num [1:91] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "names")= chr [1:91] "2020-03-25" "2020-03-26" "2020-03-27" "2020-03-28" ...
##  $ B:List of 2
##   ..$ COV19L: Named num [1:91] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "names")= chr [1:91] "2020-03-25" "2020-03-26" "2020-03-27" "2020-03-28" ...
##   ..$ COV19G: Named num [1:91] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "names")= chr [1:91] "2020-03-25" "2020-03-26" "2020-03-27" "2020-03-28" ...

We collect all one-sided tests for alternative less (L) and all one-sided tests for alternative greater (G) seperately in a matrix using sapply, shown for the event COV19 and the greater (G) test in the example below.

eValuesCOV19L.matrix <- sapply(hospitals, function(hospital) eValuesCOV19.str[[hospital]]$COV19L)

kable(head(eValuesCOV19L.matrix))
A B
2020-03-25 1 1
2020-03-26 1 1
2020-03-27 1 1
2020-03-28 1 1
2020-03-29 1 1
2020-03-30 1 1
kable(head(eValuesCOV19L.matrix[row.names(eValuesCOV19L.matrix) > as.Date("2020-05-18"), ]))
A B
2020-05-19 1.176471 1.071429
2020-05-20 1.176471 1.071429
2020-05-21 1.107266 1.071429
2020-05-22 1.107266 1.071429
2020-05-23 1.107266 1.071429
2020-05-24 1.022092 1.071429

We do so for each combination of the two one-sided tests and the two events COV19 and COV19hosp, and then multiply each row of e-values for the same date using apply() and prod(), to the rows (MARGIN = 1) to obtain our overall e-value sequence:

eValues <- list("COV19L" = apply(sapply(hospitals, 
                                        function(hospital) eValuesCOV19.str[[hospital]]$COV19L),
                                 MARGIN = 1, FUN = prod, na.rm = T),
                "COV19G" = apply(sapply(hospitals, 
                                        function(hospital) eValuesCOV19.str[[hospital]]$COV19G),
                                 MARGIN = 1, FUN = prod, na.rm = T),
                "COV19hospL" = apply(sapply(hospitals, 
                                        function(hospital) eValuesCOV19hosp.str[[hospital]]$COV19hospL),
                                     MARGIN = 1, FUN = prod, na.rm = T),
                "COV19hospG" = apply(sapply(hospitals, 
                                        function(hospital) eValuesCOV19hosp.str[[hospital]]$COV19hospG),
                                     MARGIN = 1, FUN = prod, na.rm = T))

Plot the e-value sequences such as in the dashboard tab for COV19

plot(NULL, xlim = as.Date(c(as.Date("2020-03-23"), as.Date(Sys.Date()))),
     ylim = c(-5, 9), xaxs = "i",
     ylab = "e-value", xlab = "Calendar days", yaxt = "n", xaxt = "n",
     main = "Test: COVID19 hr < 1, benefit")
axis <- mapply(axis, side = 2, at = c(-5:8, log(1/designObjCOV19L$alpha)/log(2), 2^9), cex.axis = 1,
               labels = c(paste0("1/", 2^(5:1)), 2^(0:7), "", 1/designObjCOV19L$alpha, 2^9), las = 1,
               tick = T, tck = 1, col.ticks = "lightgrey", lty = "dotted")
axis.Date(side = 1, at = seq.Date(from = as.Date("2020-03-23"), to = as.Date(Sys.Date()), by = "week"),
          format = "%g/%m/%d", tick = T, tck = 1, col.ticks = "lightgrey", lty = "dotted")
abline(h = log(1/designObjCOV19L$alpha)/log(2), lty = 2)

lines(x = as.Date(names(eValues$COV19L)), y = log(eValues$COV19L)/log(2),
      'l', col = "darkgreen", lwd = 2)

plot(NULL, xlim = as.Date(c(as.Date("2020-03-23"), as.Date(Sys.Date()))),
     ylim = c(-5, 9), xaxs = "i",
     ylab = "e-value", xlab = "Calendar days", yaxt = "n", xaxt = "n",
     main = "Test: COVID19 hr > 1, harm")
axis <- mapply(axis, side = 2, at = c(-5:8, log(1/designObjCOV19G$alpha)/log(2), 2^9), cex.axis = 1,
               labels = c(paste0("1/", 2^(5:1)), 2^(0:7), "", 1/designObjCOV19G$alpha, 2^9), las = 1,
               tick = T, tck = 1, col.ticks = "lightgrey", lty = "dotted")
axis.Date(side = 1, at = seq.Date(from = as.Date("2020-03-23"), to = as.Date(Sys.Date()), by = "week"),
          format = "%g/%m/%d", tick = T, tck = 1, col.ticks = "lightgrey", lty = "dotted")
abline(h = log(1/designObjCOV19G$alpha)/log(2), lty = 2)

lines(x = as.Date(names(eValues$COV19G)), y = log(eValues$COV19G)/log(2),
      'l', col = "darkgreen", lwd = 2)

Plot the e-value sequences such as in the dashboard tab for COV19hosp

plot(NULL, xlim = as.Date(c(as.Date("2020-03-23"), as.Date(Sys.Date()))),
     ylim = c(-5, 6), xaxs = "i",
     ylab = "e-value", xlab = "Calendar days", yaxt = "n", xaxt = "n",
     main = "Test: COVID19 hospitalization hr < 1, benefit")
axis <- mapply(axis, side = 2, at = c(-5:5, log(1/designObjCOV19hospL$alpha)/log(2), 6), 
               cex.axis = 1, las = 1,
               labels = c(paste0("1/", 2^(5:1)), 2^(0:4), "", round(1/designObjCOV19hospL$alpha), ""), 
               tick = T, tck = 1, col.ticks = "lightgrey", lty = "dotted")
axis.Date(side = 1, at = seq.Date(from = as.Date("2020-03-23"), to = as.Date(Sys.Date()), by = "week"),
          format = "%g/%m/%d", tick = T, tck = 1, col.ticks = "lightgrey", lty = "dotted")
abline(h = log(1/designObjCOV19hospL$alpha)/log(2), lty = 2)
        
lines(x = as.Date(names(eValues$COV19hospL)), y = log(eValues$COV19hospL)/log(2),
      'l', col = "darkgreen", lwd = 2)

plot(NULL, xlim = as.Date(c(as.Date("2020-03-23"), as.Date(Sys.Date()))),
     ylim = c(-5, 6), xaxs = "i",
     ylab = "e-value", xlab = "Calendar days", yaxt = "n", xaxt = "n",
     main = "Test: COVID19 hospitalization hr > 1, harm")
axis <- mapply(axis, side = 2, at = c(-5:5, log(1/designObjCOV19hospG$alpha)/log(2), 6), 
               cex.axis = 1, las = 1,
               labels = c(paste0("1/", 2^(5:1)), 2^(0:4), "", round(1/designObjCOV19hospG$alpha), ""), 
               tick = T, tck = 1, col.ticks = "lightgrey", lty = "dotted")
axis.Date(side = 1, at = seq.Date(from = as.Date("2020-03-23"), to = as.Date(Sys.Date()), by = "week"),
          format = "%g/%m/%d", tick = T, tck = 1, col.ticks = "lightgrey", lty = "dotted")
abline(h = log(1/designObjCOV19hospG$alpha)/log(2), lty = 2)

lines(x = as.Date(names(eValues$COV19hospG)), y = log(eValues$COV19hospG)/log(2),
      'l', col = "darkgreen", lwd = 2)