Tasks & Evaluation

There are three tasks this year:

  1. Task 1: Primary tumor (GTVp) and lymph nodes (GTVn) detection and segmentation in PET/CT images.
  2. Task 2: Recurrence-Free Survival (RFS) prediction relying on PET/CT images, available clinical information, and/or radiotherapy planning dose maps.
  3. Task 3: HPV status diagnosis using PET/CT images and/or available clinical information.

Evaluation of Task 1 - Detection and Segmentation

Algorithms producing fully automatic detection and delineation of the test cases’ primary tumors and lymph nodes will be assessed.

  1. The predicted segmentation masks should be in the same resolution as the CT and will not be resampled if this is not the case. The expected values are 1 for the predicted GTVp, 2 for GTVn, and 0 for the background.
  2. We will obtain a rank on GTVp, another rank on GTVn and compute the Borda count of these 2 rankings: (1) The GTVp ranking is based on the mean DSC across all patients. (2) The GTVn ranking is based on the Borda count of two rankings:
            a) GTVn segmentation ranking: Aggregated DSC on GTVn, similar to HECKTOR 2022, adapted from the Aggregated Jaccard Index in [Kumar et al. 2017]
             b) GTVn detection ranking: Aggregated F1-score on GTVn

For GTVn, we will compute both the aggregated DSC (similar to HECKTOR 2022) and an aggregated F1-Score for detection. We count and accumulate true positives (TPs), false negatives (FNs), and false positive (FPs) lesion detections on the entire test set, where detection is for IoU>30%. The aggregated F1-score is then, for the entire test set: F1agg = 2TP / (2TP + FP + FN).

This comprehensive multi-metric approach for different tumor structures ensures robust evaluation of both detection accuracy and segmentation quality.

Evaluation of Task 2 - Prognosis

Algorithms producing fully automated Recurrence-Free Survival (RFS) prediction of the test cases will be assessed.

The ranking will be based on the Concordance index (C-index) value obtained in the test cohort. The participant with the highest C-index wins. A bootstrap will be performed on the test set to evaluate the variance of the C-index of each team and for statistical comparison of algorithm results. This bootstrap approach provides confidence intervals for each team's performance, allowing us to determine whether differences between top-performing teams are statistically significant.

Evaluation of Task 3 - Classification

Algorithms producing fully automated diagnoses of HPV status (positive or negative) of the test cases will be assessed.

The rankings will be based on the balanced accuracy value (average of sensitivity and specificity) obtained in the test cohort. The participant with the highest accuracy wins. In the case of ties, the best specificity wins. This approach ensures fair ranking even with imbalanced test sets.

REFERENCES

[Kumar et al. 2017] Kumar N, et al. "A dataset and a technique for generalized nuclear segmentation for computational pathology." IEEE Transactions on Medical Imaging, 36(7): 1550-1560 (2017).