Radiomics feature reliability assessed by intraclass correlation coefficient: a systematic review

Cindy Xue; Jing Yuan; Gladys G. Lo; Amy T. Y. Chang; Darren M. C. Poon; Oi Lei Wong; Yihang Zhou; Winnie C. W. Chu

doi:10.21037/qims-21-86

Review Article

Radiomics feature reliability assessed by intraclass correlation coefficient: a systematic review

Cindy Xue^1,2, Jing Yuan^{1^}, Gladys G. Lo³, Amy T. Y. Chang⁴, Darren M. C. Poon⁴, Oi Lei Wong¹, Yihang Zhou¹, Winnie C. W. Chu^{2^}

¹Medical Physics and Research Department, Hong Kong Sanatorium & Hospital, Happy Valley, Hong Kong, China; ²Department of Imaging and Interventional Radiology, Faculty of Medicine, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong, China; ³Department of Diagnostic & Interventional Radiology, Hong Kong Sanatorium & Hospital, Happy Valley, Hong Kong, China; ⁴Comprehensive Oncology Centre, Hong Kong Sanatorium & Hospital, Happy Valley, Hong Kong, China

Contributions: (I) Conception and design: J Yuan, C Xue; (II) Administrative support: J Yuan, C Xue, OL Wong, Y Zhou; (III) Provision of study materials or patients: J Yuan, C Xue, OL Wong; (IV) Collection and assembly of data: J Yuan, C Xue, OL Wong, Y Zhou; (V) Data analysis and interpretation: All authors; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

^{^}ORCID: Jing Yuan, 0000-0001-8112-3608; Winnie C. W. Chu, 0000-0003-4962-4132.

Correspondence to: Jing Yuan, PhD. 8/F, Li Shu Fan Block, Hong Kong Sanatorium & Hospital, 2 Village Road, Happy Valley, Hong Kong, China. Email: jyuanbwh@gmail.com.

Abstract: Radiomics research is rapidly growing in recent years, but more concerns on radiomics reliability are also raised. This review attempts to update and overview the current status of radiomics reliability research in the ever expanding medical literature from the perspective of a single reliability metric of intraclass correlation coefficient (ICC). To conduct this systematic review, Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines were followed. After literature search and selection, a total of 481 radiomics studies using CT, PET, or MRI, covering a wide range of subject and disease types, were included for review. In these highly heterogeneous studies, feature reliability to image segmentation was much more investigated than reliability to other factors, such as image acquisition, reconstruction, post-processing, and feature quantification. The reported ICCs also suggested high radiomics feature reliability to image segmentation. Image acquisition was found to introduce much more feature variability than image segmentation, in particular for MRI, based on the reported ICC values. Image post-processing and feature quantification yielded different levels of radiomics reliability and might be used to mitigate image acquisition-induced variability. Some common flaws and pitfalls in ICC use were identified, and suggestions on better ICC use were given. Due to the extremely high study heterogeneities and possible risks of bias, the degree of radiomics feature reliability that has been achieved could not yet be safely synthesized or derived in this review. More future researches on radiomics reliability are warranted.

Keywords: Radiomics; reliability; intraclass correlation coefficient (ICC); quantitative imaging; oncology

Submitted Jan 22, 2021. Accepted for publication May 17, 2021.

doi: 10.21037/qims-21-86

Introduction

Radiomics has become one of the most popular research areas in medical imaging, in particular for clinical oncology, since its first introduction by Lambin et al. in 2012 (1). According to Gillies et al., radiomics is defined as “the conversion of images to higher dimensional data and the subsequent mining of these data for improved decision support” (2). These higher dimension data are normally understood as the information contained in a large number of quantitative radiomics features derived from the original or transformed medical images, which are usually artificially engineered with mathematical definition and have continuous values. By utilizing the radiomics models built on the selected radiomics features, radiomics promises to increase diagnosis accuracy and precision, assessment of prognosis, and therapy response prediction for different clinical applications, bridging between medical imaging and personalized medicine (3,4). A tremendous number of papers on radiomics have been published in recent years (5). However, despite the promising results reported, the broad validity, and generality of radiomics are still much hindered by the concerns on its reliability (6-11). Variability and uncertainty of radiomics can be introduced in many procedures of its complicated workflow. These procedures include but not limited to imaging hardware configuration, patient setup, image acquisition, reconstruction, image post-processing (filtering, segmentation and registration, etc.), radiomics feature quantification (such as feature definition, calculation setting like image discretization, software implementation, calculation result harmonization, etc.), and radiomics modeling.

The term reliability is commonly used with other terms like agreement, repeatability, stability, reproducibility, accuracy/precision, and robustness in varying degrees of consistency in the medical literature. In this study, a general and mathematically expressible definition of reliability (R) is adopted to be the extent to which measurements can be replicated. It is expressed as the ratio of true (error-free) variance (σT²) over true variance plus error variance (σ_E²), i.e., R=σ_T²/(σ_T² + σ_E²). This definition of reliability is compliant with the classical definition of intraclass correlation coefficient (ICC), using the between-subject variance in the trait of interest to represent the true variance since it cannot be directly measured in reality.

$I C C = \frac{b e t w e e n s u b j e c t v a r i a n c e}{b e t w e e n s u b j e c t v a r i a n c e + w i t h i n s u b j e c t m e a s u r e m e n t v a r i a n c e}$ [1]

ICC is one of the most widely adopted reliability indexes based on the analysis of variance (ANOVA) in medical literature (12). ICC is applicable for all radiomics features that have continuous values. In addition, ICC is a ratio index ranged from 0 to 1, so it is useful in the cross-study reliability comparison. For these reasons, ICC is chosen in this study as the single statistical metric for radiomics feature reliability assessment. We adopted the ICC forms in McGraw and Wong Convention, including three components of model (one-way or two-way, random-effects or mixed-effects), type (single or multiple measurements/raters), and definition (absolute agreement or consistency), following the guideline proposed by Koo et al. (13).

Reliability of radiomics has to be carefully and rigorously measured and assessed prior to its real clinical deployment, but generic radiomics reliability is still not yet fully explored, so not well known. An excellent systematic review on repeatability and reproducibility of radiomics features was published in 2018 (14), in which the qualitative synthesis on 41 studies revealed the status of radiomics reliability research until April 2017. Since then, the status of radiomics reliability research has not been timely updated in the pace of an ever fast increasing number of radiomics publications. Few strongly evident consensus has been reached and well-acknowledged so far.

Thus, this review attempts to serve multi-fold purposes: (I) to have a timely updated overview on the current status of radiomics reliability research, mainly from the perspective of ICC use in the medical literature; (II) to survey what ICC was used for, and what were the major findings of radiomics reliability as revealed by the reported ICCs; (III) to critically review how ICC was used, reported and interpreted; (IV) to give some suggestions on ICC use to mitigate the flaws and pitfalls, if applicable, so as to improve radiomics reliability assessment for future studies.

Methods

Systematic search strategy

The major research question for the literature search was described as: ‘‘What are the known radiomics studies that used ICC as a radiomics feature reliability index, and reported the quantitative ICC results (as either major or secondary outcome)”. Thus, a comprehensive literature search was conducted by two authors (JY and CX) to identify the relevant published studies in the database of MEDLINE/PubMed (National Center for Biotechnology Information, NCBI), from 1 January 2012 to 8 December 2020 (ePub date).

A combination of the following terms and their common variations: “CT/PET/MRI”, “radiomics/radiomic/texture analysis/quantitative (heterogeneity) feature”, “ICC/intraclass correlation” were comprehensively used for literature search. Imaging modalities other than CT, PET, and MRI, such as ultrasound, X-ray, cone-beam CT, and Megavoltage CT, were not included in this search due to their relative minority and immaturity in the radiomics research. Image analysis based solely on the gray level histogram, i.e., histogram analysis, does not provide any voxel positional/distributive information on the images, so it was not included.

Study selection

Only full-text journal or conference articles written in English were eligible and included. Conference abstracts, case reports, (systematic) reviews, editorials/commentaries, expert opinion papers, and non-English papers were excluded from selection.

After article type exclusion, all publications that involved the use of ICC for feature reliability assessment were identified through full text (and Supplementary materials if needed) examination in the searching results. If a study mentioned the ICC use in the method but reported no ICC results, it was also excluded.

Three authors (JY, CX, and OW) worked jointly on the study selection procedures as described above. Disagreements were resolved by consensus. Reasons for exclusion were documented.

Data extraction

Four authors (JY, CX, OW, and YZ) jointly performed record extraction. The study information on publication date, imaging modality, study design, study subject (phantom, animal or human), organ, disease, radiomics feature type/number was extracted. In terms of ICC use and reporting, the purpose of using ICC, the sample size for ICC calculation, ICC form, ICC reporting format, and major ICC results were extracted. Despite the high heterogeneity of ICC result reporting in different studies, we attempted to extract, synthesize, or harmonize the ICC results in the form of satisfactory feature rate (SFR), i.e., the percent of features showing satisfactory (determined by excellent, good, or other ICC criteria in each individual publication) ICC in the total investigated features, as much as possible. In this way, cross-study ICC comparison might become feasible to some extent. Radiomics quality score (RQS) of the extracted studies were not individually appraised since RQS might not be applicable for many of these studies because they were not completely clinical application studies (4). The quality of ICC use and reporting was not scored either. QUADAS-2 was not applied for study appraisal either since the diagnostic accuracy was not the common purpose of the included radiomics studies (15).

Outcomes and prioritizations

The primary outcome of interest in this review was radiomics feature reliability in different aspects as assessed by ICC. Quality of ICC use and reporting was the secondary outcome. Other statistic metrics used in combination with ICC were only noted but not further analyzed. The outcome was not prioritized on specific imaging modality or disease type.

Risk of bias analysis

Two authors (CX and JY) jointly assessed the possible risk of bias in the included studies from the extracted study information with consensus in the following perspectives. (I) study characteristics such as the study design (retrospective or prospective), cohort, sample size and feature number; (II) appropriateness of methodology, and sufficiency of method description and disclosure, such as the details of imaging acquisition, post-processing, segmentation, as well as feature definition (standardization) and quantification; (III) the quality of ICC use and reporting, such as the ICC form selection, confidence interval reporting, threshold values, and interpretation.

The TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) Statement (16) was not applied for two major reasons. First, it was not applicable for many studies because the investigation of radiomics feature reliability did not necessarily lead to the final prognosis or diagnosis performance report. Second, the clinical purposes were beyond the scope of this review, and they were also highly heterogeneous, including but not limited to prognosis or diagnosis.

Results

Literature search and selection

The PubMed search yielded 2,596 records. Records were reduced to 2,580 after duplicates were removed. The subsequent article type, title, and abstract screening excluded 182 records. Then the remaining 2,398 records underwent full text (and Supplementary materials) examination, and 1,917 records were further excluded. Finally, 481 studies were included in this systematic review. The selection process is illustrated in Figure 1.

Figure 1 Flowchart of the study selection process.

Statistics of the included publications

Figure 2 shows the increasing number of published radiomics studies in which ICC was used for radiomics reliability assessment from 2012 to 2021. In recent years since 2017, MRI articles show a much faster increase than CT, and PET.

Figure 2 Publication number based on imaging modality in recent years.

In terms of publication number, CT, PET and MRI account for 50.73% (n=244), 9.77% (n=47), and 42.20% (n=203) of the total publications (N=481), respectively. Note that some publications involved multi-modality radiomics, so the sum was slightly over 100%.

In the included articles, 18.09% (87/481) and 81.91% (394/481) of them assessed reliability using ICC as the major study outcome and secondary (partial) outcome, respectively. 5.82% (n=28) and 94.18% (n=453) studies were prospective and retrospective (including those studies using prospectively acquired imaging data for clinical purposes other than radiomics) in nature.

The numbers of studies based on subject type and anatomical region are illustrated in Figure 3.

Figure 3 Study number based on subject type and anatomical region. The study number is counted repeatedly if multiple subject types or anatomical regions were involved in a study.

Clinical oncology is the most common clinical application in these studies, accounting for 86.69% (417/481) of the total publications. Lung cancer is the most common type, followed by head and neck cancer and neuro-oncology. Imaging modality reflects variation in predominant type of cancer, as PET and CT studies showed a higher proportion of lung and head and neck patients, while MRI studies comprised more neuro-oncology, breast cancer, cervical cancer, and prostate cancer. The publication distribution based on cancer types, and imaging modalities is illustrated in Figure 4.

Figure 4 Distribution of oncological patient studies. The study number is counted repeatedly for each type of cancer if a study investigated more than one cancer type.

The characteristics of human radiomics studies

The use of ICC in the included human radiomics studies was categorized by the ICC purpose. The characteristics of the studies were summarized in the following tables. If a study involved more than one ICC purpose, each purpose was separately listed in the corresponding table. SFR was directly extracted or synthesized from the original data from each study as much as possible. However, in many cases, SFR was not available or could not be clearly and reliably extracted, which was thus labeled NA in the tables. One common reason was that ICC was reported in other forms like mean ± SD. Another reason was that ICC was applied for some but not all features, e.g., for those features after correlation assessment or clustering. Meanwhile, for some other studies, very comprehensive ICC results were reported. The SFRs could not be simply extracted and listed. They were labeled as NE.

Radiomics feature reliability due to image acquisition

The use of ICC in the included human radiomics studies to assess the feature reliability due to image acquisition was summarized in Table 1.

Table 1 Summary of human radiomics studies using ICC for image acquisition
Full table

Radiomics feature reliability due to image reconstruction

The use of ICC in the included human radiomics studies to assess the feature reliability due to image reconstruction was summarized in Table 2.

Table 2 Summary of human radiomics studies using ICC for image reconstruction
Full table

Radiomics feature reliability due to image segmentation

There were 416 out of 481 studies reported the ICC results regarding the feature reliability influenced by image segmentation. Only those studies that substantially investigated the radiomics feature reliability to image segmentation as the primary study endpoint were listed in Table 3. For other studies, most simply mentioned the ICC use for intra-/inter-observer agreement, and reported a very short ICC (usually optimistic) result, in which SFR was extracted from 206 studies. The SFR distribution was provided by the histogram in Figure S1, showing no apparent differences between imaging modalities (Figure S2).

Table 3 Summary of human radiomics studies using ICC for image segmentation
Full table

Radiomics feature reliability due to image processing

The use of ICC in the included human radiomics studies to assess the feature reliability due to image processing was summarized in Table 4.

Table 4 Summary of human radiomics studies using ICC for image processing
Full table

Radiomics feature reliability due to feature quantification

The use of ICC in the included human radiomics studies to assess the feature reliability due to feature quantification was summarized in Table 5.

Table 5 Summary of human radiomics studies using ICC for feature quantification
Full table

The characteristics of phantom and animal radiomics studies

The study characteristics in the included phantom and animal radiomics studies were summarized in Table 6.

Table 6 Summary of phantom and animal studies
Full table

Quality of ICC use and reporting

Generally speaking, the quality of ICC use and reporting was found unsatisfactory in many publications, associating with various flaws and pitfalls. Only 63 studies (13.10%) explicitly and precisely reported the selected ICC form, in which the ICC definition of absolute agreement rather than consistency on feature values predominated. The rationale of ICC form selection was seldom explained. In the remaining 418 articles, the ICC form was either unavailable, implicit (e.g., giving the general ICC formula not specific to a certain ICC form), or incomplete. The available ICC forms described in the studies, either completely or not, were summarized in the Table S1. Very few studies tested the normal distribution of data prior to ICC use (as ICC was based on ANOVA). The adopted reliability criteria/level and the corresponding threshold ICC values could be found in most studies but were heterogeneous. The reliability levels could be binary (low/high, acceptable/unacceptable, stable/unstable, repeatable/unrepeatable, etc.), three (e.g., poor/moderate/good), four (e.g., poor/moderate/good/excellent), five (e.g., poor/fair/moderate/good/excellent) and even six. The thresholds of >0.7, >0.75, >0.8, and >0.9 were frequently used to determine the highest reliability level. The thresholds of <0.2, <0.4, and <0.5 were frequently adopted to determine the lowest reliability level. The reported ICC values were normally presented in the mean (± SD), median, range, or interquartile range (IQR). Confidence interval (usually 95% CI) was reported along with ICC only in 64 studies. Many studies seemed to interpret reliability levels based on the estimated ICC values without giving or referring to the reported ICC confidence interval.

Notable findings of radiomics feature reliability as revealed by ICC

The reported ICC results were highly heterogeneous, varying by imaging modality, ICC purpose, disease, lesion type, sample size, as well as feature types. Meanwhile, they were also frequently reported with pitfalls. Therefore, it was impractical to conduct quantitative data synthesis and meta-analysis based on the reported ICCs to reliably estimate the achievable absolute radiomics reliability levels for different modalities, purposes, or diseases. But there were still a few notable consistent findings observed on the report ICCs even in the presence of high study heterogeneities.

High satisfactory feature rates were reported for most intra/inter-observer segmentation studies, indicating the high robustness of many radiomics features to intra/inter-observer segmentation variability

In the hundreds of articles using ICC for assessing intra/inter-observer segmentation reliability of radiomics features, only a small number reported relatively negative reliability results. For instance, Jang et al. (39) showed that in the inter-observer segmentation reproducibility study in cardiac patients, only 32.1%, 46.7%, and 35.5% of MRI radiomics features were reproducible with the cine bSSFP, T1 mapping, and T2 mapping, worse than the corresponding 73.1%, 66.8%, and 61.1% reproducible features in the intra-observer segmentation. Liu et al. (84) investigated 109 radiomics features on 436 contrast-enhanced CT images of oropharyngeal cancer patients and found that “most radiomic features in this study varied a lot when the ROIs were not well segmented. For both the representation agreement and predictive agreement, the ICC and CCC were below 0.5 for all the features.” Uthoff et al. (79) reported that “observers had perfect intra-repeatability (ICC =1.0)” but “demonstrated fair inter-reader variability (ICC =0.52)” for 4 observers (2 radiologists, 2 pulmonologists) in 100 cases of non-small cell lung cancer CT scans. Many other studies generally reported high SFRs, implying excellent robustness to intra/inter-observer segmentation disagreement, independent of modalities and diseases, although different ICC thresholds were applied. Meanwhile, radiomics feature robustness to inter-observer segmentation seemed not notably inferior to intra-observer segmentation.

Comparable or better radiomics feature reliability was reported for (semi-)automated segmentation than manual segmentation with much shorter segmentation time

Manual image segmentation in radiomics analysis involved intensive labor work of clinicians and was time-consuming, and also suffered from intra-/inter-observer segmentation disagreement, leading to the low cost-effectiveness/efficiency of radiomics so greatly hampering its wide application in clinical practice. Thus, lots of efforts were taken to develop (semi-)automated segmentation as a potential alternative in radiomics research. Moreover, (semi-)automatic segmentation was frequently reported useful to further reduce the intra-/inter-observer radiomics feature variability induced by manual segmentation in the included studies in addition to its advantage in segmentation time, suggesting the future role of (semi-)automated segmentation in more reliable and cost-effective/efficient radiomics analysis (67,69,70,77,80,99,103,141,142).

Acquisition had substantial impacts on radiomics feature values, and their impact on feature reliability was larger than the impact by intra-/inter-observer segmentation

Based on the reported ICC results, image acquisition had substantial impacts on the radiomics feature values for all imaging modalities and acquisition protocols. Regarding modality dependence, the reported SFRs seemed to be highest in PET and lowest in MRI [excluding the outlier of 100% inter-scanner SFR reported by Zhang J et al. (46)], which might be partially explained by the relatively smooth low-resolution PET image and the multi-contrast high-resolution MRI images with considerable anatomical details acquired by different sequences. The simple intra-scanner test-retest could introduce considerable feature value variations(17,18,24,28,32,36,40,44,139). The inter-scanner (or inter-center) acquisition (with similar imaging protocols or imaging parameter changes) induced even more radiomics feature variability than the intra-scanner test-retest (32,35,45,135,137,139). In the studies investigating both acquisition and segmentation, acquisition was consistently reported to have much larger impacts on feature variability (always smaller ICCs) than segmentation (35,39,42,44,48,56,126,132) for all modalities.

Feature reliability and ICCs were heterogeneous for post-processing and feature quantification; optimized post-processing and feature quantification could be used to mitigate acquisition-induced radiomics variability

Image post-processing and feature quantification were usually used to explore the robustness of radiomics features and improve feature reliability by optimized or standardized approaches. Among these approaches, image intensity discretization and normalization were most frequently investigated. Actually, there could be tremendous types of image post-processing and feature quantification methods, algorithms, and tools that were applicable to the acquired original images and thus had remarkable influences on feature values. In many studies, various post-processing and feature quantification approaches were conducted and optimized to mitigate the possible radiomics feature variability introduced in the acquisition procedure (28,31,36,41,44,111,114,129). The results suggested that comprehensive image perturbation and quantification might be helpful to improve radiomics reliability, in particular for those retrospective radiomics studies in which existing imaging data were used without control on imaging acquisition protocol. For example, in a study by Zwanenburg et al. (28), image perturbation chains were proposed to be used as an alternative to test-retest imaging to assess feature robustness. Most robust features in acquisition test-retest were successfully identified by comprehensive image perturbations. In another study by Suter et al. (64), single-center MRI data was perturbed to simulate unseen multi-center MRI data with greater variabilities, which generated and conducted over 16 million tests of typical perturbations and to identify robust radiomics features for multi-center radiomics study. In contrast, post-processing and feature quantification were seldom proposed for mitigation or compensation for the radiomics feature variability affected by segmentation.

Shape and first-order (FO) radiomics features were frequently reported to be more robust to various variability factors than texture features in the original image domain

Different types of radiomics features could subject to different levels of variability influenced by different factors. Among the heterogeneous reported results, it was noticed that shape or first-order (FO, or named histogram) features in the original image domain were often reported to be more robust than texture (also named second-order or higher-order) features to different variability factors of acquisition (18,36,44,57,134), post-processing and quantification (36,115,116,128,130,132), and segmentation (94,99,104,142-145). In different types of texture features, GLCM (gray-level co-occurrence matrix) features were observed to be more robust than other texture features in a few studies (18,44,94,116,128,143,145). On the other hand, opposite or deviant results on the low reliability of shape features were also occasionally reported. For instance, Rai et al. reported that none of the shape features exhibited high inter-(MRI) scanner stability (ICC >0.8), the lowest among all feature types (134). Tixier et al. showed that shape features in MRI (ICC =0.74) were among the most impacted feature types by the choice of segmentation method, with poorer reliability than first-order and GLCM features (ICC >0.96) (94). Beyond the radiomics features in the original image domain, radiomics features in the transformed domains, most frequently in the Laplacian of Gaussian (LoG) filtered domain and wavelet domain, were also investigated in many studies. No uniform robustness of these transformed features compared to those original features could be derived from the included studies.

Other statistical metrics in conjunction with ICC

A variety of statistical metrics were used in conjunction with ICC for different purposes. For segmentation purposes, dice similarity coefficient (DSC) was often reported. Bland-Altman analysis was conducted in some studies involving paired comparison of two observers/acquisitions/measurements. Other types of statistical metrics such as concordance correlation coefficient (CCC), coefficient of variation (CV), Pearson/Spearman correlation coefficient, false discovery rate (FDR), (normalized) dynamic range, Krippendorff’s alpha, percentage difference, and between-class distance (BD) were also used in combination with ICC.

Risk of bias

There were different levels and aspects of potential risks of bias in the included studies for many reasons. Many clinical studies were limited in their retrospective study nature and were usually conducted without phantom validation and control on acquisition protocol. Many technical studies utilized public imaging data, and the heterogeneity in these data might not be well understood or compensated. Very few studies described the imaging protocol sufficiently to the desired level of detail, as suggested in (4). Similarly, details in the intra-/inter-observer segmentation process were normally insufficiently described. For many clinical studies aiming for radiomics diagnosis or prognosis performance, the possible publication bias on the very high feature reliability to intra/inter-observer segmentation might not be neglected in that much lower ICC and SFRs were reported in the studies that substantially assessed feature robustness to segmentation as the primary study endpoint. The statistical power of the calculated ICC might not be strong enough due to the limited sample size and observer/acquisition/measurement number. The investigated radiomics features in many studies might have different definitions even with the same or similar name. They might not have been well standardized due to the different implementations in a variety of software and in-house built programs, in particular in the studies before the proposal of feature standardization by the image biomarker standardization initiative (IBSI) (146). Besides, risks of bias could also be induced by the flaws and pitfalls of using and reporting ICC as identified in the articles.

Discussion

Radiomics research is experiencing an increasing explosive rate both in publication volume and diversity in recent years (5). However, along with the soaring publication numbers of radiomics, more concerns, questions, and/or criticisms on radiomics reliability are also increasingly raised in the last few years (11,147,148). Some recent systematic review papers (149-153) also showed that many radiomics publications had suboptimal or poor study quality, as revealed by the low RQS.

Reliability is highly correlated to RQS criteria. For instance, image protocol quality (+1 point), phantom study on all scanners (+1 point) and imaging at multiple time points (+1 point) are all related to acquisition reproducibility/replicability/reliability; Multiple segmentations (+1 point) is related to intra-/inter-observer agreement analysis; Feature reduction or adjustment for multiple testing (+3 point if implemented or −3 if not implemented) is related to feature correlation and redundancy analysis. ICC could be used to fulfill these criteria on RQS to improve radiomics study quality.

The increasing publication number with time reflects the fact that much more efforts have been taken to investigate radiomics reliability in recent years, in particular for MRI. Clinical oncology is still the major arena of radiomics research as revealed in this review, consistent with a recent bibliometric review (5), while different imaging modalities have substantially different roles in different cancers, as reflected in the proportions of the publications.

Regarding the ICC purpose, it is within the expectation that ICC was most frequently used for segmentation reliability assessment, particularly for intra-/inter-observer agreement. This could be mainly explained by the large fractions of retrospective studies. It also reflects the common interests and concerns on radiomics reliability from clinicians. It is interesting to notice that many intra-/inter-observer agreement studies reported high ICC values or SFRs, which might suggest that many radiomics features are quite robust to (manual) lesion segmentation. In other words, radiomics reliability to lesion segmentation might not be much concerned.

Although segmentation dominated the ICC use, the importance of radiomics reliability in other aspects could not be overlooked or underestimated. Image acquisition and reconstruction are at the very front-end of the complex radiomics workflow, and greatly impact the quality of the original imaging data for radiomics reliability assessment. Indeed, image acquisition and reconstruction were strongly suggested to impact much more on radiomics reliability than segmentation. However, the influence of image acquisition and reconstruction on radiomics feature reliability was still much underexplored relative to segmentation studies. There are still many unknowns about how image acquisition and reconstruction affect radiomics reliability. Much research work is warranted in the future, in particular for MRI, due to its semi-quantitative image intensity nature, various image contrasts, and much greater variability in image acquisition and reconstruction compared to CT and PET.

ICC was also frequently used for reliability assessment attributed to post-acquisition image processing and feature quantification. The reported ICCs were heterogeneous in these studies, much dependent on the various processing types and different implementations. In theory, there could be infinite types of post-processing methods applicable to original images, so potentially lead to even bigger radiomics feature variability than acquisition and reconstruction. But, in practice, post-processing and feature quantification are investigated and utilized to mitigate acquisition and reconstruction-related feature variability by taking advantage of comprehensive and powerful computation capability for robust feature selection, without the need for prior knowledge on image acquisition variability. However, the evidence is so far not strong enough. More rigorous validation and evaluation are definitely warranted.

Some common pitfalls of using and reporting ICC were frequently identified in the articles. First, the information on ICC form and its selection was missing, ambiguous or incomplete in a large number of articles. Meanwhile, relevant information like scanner/observer/measurement numbers sometimes was not clarified to facilitate ICC form selection. When ICC is used as a reliability metric, it is important for researchers to carefully select the most appropriate ICC form. The inappropriate selection of ICC form might mathematically yield similar ICC values but could lead to substantially different and even misleading interpretations. No article conducted sample size estimation for ICC calculation, which could be helpful, although might not be necessary, for ICC precision estimation. In terms of ICC reporting, ICC values were often reported without the confidence interval. Without reporting the confidence interval, the precision of ICC could not be known. For instance, a very high ICC value but associating with a very wide width of confidence interval (large uncertainty and low precision) could not guarantee the high reliability. Heterogeneous ICC thresholds were used for reliability assessment, also hampering rigorous data synthesis for cross-study comparison. Occasionally, ICC threshold values were only implicitly indicated or unavailable. The reliability levels of poor (ICC <0.5), moderate (ICC: 0.50–0.75), good (ICC: 0.75–0.90), and excellent (ICC >0.90) as suggested in (13) were frequently adopted. But, on the other hand, the conditions under which the criteria were suggested were usually neglected, i.e., “As a rule of thumb, researchers should try to obtain at least 30 heterogeneous samples and involve at least 3 raters whenever possible when conducting a reliability study.” (13). The reliability levels were also seemed to be inappropriately determined on the basis of the ICC value itself rather than its confidence interval. ICC was normally interpreted without further quantifying the underlying true variance (σ_T²) and error variance (σ_E²). Actually, a high ICC value might mainly reflect the high between-subject heterogeneity (such as malignant tumors) in the sampled population but does not guarantee the accuracy or precision of radiomics feature quantification. Vice versa, a low ICC might probably be resulted from the high homogeneity (such as normal tissues) in the subjects, even with high measurement accuracy and precision.

Some suggestions could be given to mitigate the identified pitfalls for future radiomics studies. Overall, if radiomics reliability itself is the major purpose of a radiomics study, guidelines for reporting reliability and agreement studies (GRRAS) should be helpful in the study planning (154). In order to facilitate ICC form selection, the model, type, and definition of the ICC form should be justified or explained. Relevant information like scanner/observer/measurement numbers need to be sufficiently disclosed. The guideline proposed by Koo et al. is an excellent reference and is easy to follow (13). It would be very helpful to conduct sample size estimation for ICC calculation in order to assure that the study could have an adequate chance of achieving the desired ICC precision (155,156). After ICC form selection, the tool used for ICC calculation should be reported with software name, version, and setting. The ICC calculation results should be reported along with the confidence interval. Meanwhile, the criteria for ICC appraisal should be clearly described. It should also be kept in mind that it is the estimated CI forms the basis to evaluate the reliability level, but not the ICC value itself. Along with ICC, the joint use of other statistical metrics could strengthen the study quality and statistical power. For instance, if paired observers/acquisitions/measurements were involved, Bland-Altman analysis is anticipated and beneficial. For segmentation reliability assessment, dice similarity coefficient (DSC) is desirable. Last but not least, the acceptability of ICC should be determined on the requirements by each specific study and clinical application, rather than simply on the calculated values from the specific sample populations and pre-defined thresholds.

There are some limitations to this study. First, the literature search in a single database was one limitation, although partially compensated by the prior knowledge on additional papers. Meanwhile, even in a single database, there are tremendous numbers of publications relevant to radiomics, but it is not uncommon that a variety of terms are used instead, which makes the precise localization of these publications even more difficult. So, there might still be potentially eligible studies missed for analysis. The ICC use and its result reporting had to be recognized and extracted through full text (and even Supplementary materials) examination rather than title and abstract screening. This procedure involved tremendous work and might slightly affect the inclusion and exclusion of papers. Nonetheless, the large sample size of 481 studies should not considerably weaken or bias the statistics in this review. Second, this review concentrated on a single metric of reliability, i.e., ICC, which tackles only a very narrow topic on general radiomics reliability. ICC is only applicable to continuous variables, so the radiomics reliability revealed by ICC is usually on the level of radiomics feature values. The role of ICC is relatively minor in the reliability aspects of radiomics feature reduction and modeling, as well as model outcome/performance assessment. It is acknowledged that many other statistical metrics could be applicable or more suitable in radiomics reliability assessment in various scenarios, providing complementary or additional information on radiomics reliability beyond ICC. Thus, the current status of radiomics reliability could only be partially reflected in the included papers. This study by no means formed a systematic review and meta-analysis on the diagnostic accuracy of radiomics, so study quality was not individually assessed in each article by following QUADAS-2 (15), TRIPOD (16) or RQS (4), but PRISMA statement was followed (157). Third, there were great difficulties in study quality normalization, data synthesis, and harmonization on the highly heterogeneous study characteristics along with the pitfalls in ICC use and reporting. It was of great difficulty to conduct quantitative analysis on cross-study ICC assessment. The use of SFR slightly mitigated this issue, but SFR itself also had pitfalls such as different ICC thresholds. Therefore, the consensus on the degree of radiomics reliability that has been achieved, or could be achievable in radiomics research could not be safely derived. Fourth, radiomics feature reliability has been suggested to be dependent on imaging modality, organ, disease, and other factors, which was also noticed in some included individual studies (72,89,91,103,104,158). But these dependencies could not be further generalized in this review. Our study collected, analyzed and presented data in a modality-neutral and disease-neutral way. Moreover, we also recognized that it was still an extremely difficult task for this dependency investigation in the presence of high heterogeneities of study characteristics even though hundreds of studies had been included. But, on the other hand, it should be cautioned that there might be a potential risk of bias by trying to present modality-neutral or disease-neutral common findings in the study. The validity of these findings might be violated if applied to some fewer common diseases or other modalities. Therefore, future research efforts on disease-specific and modality-specific feature reliability are desirable. Fifth, some flaws and pitfalls in selecting, reporting, and interpreting ICC were identified in many radiomics studies, so some suggestions were given. But we did not intend to specifically propose a standardized form of ICC use for future radiomics studies. The standardization of QIB metrology (159), the IBSI radiomics feature standardization (146,160), the guidelines for reporting reliability and agreement studies (GRRAS) (154), the general guideline of selecting and reporting ICCs (13), and statistical methods for clinical reliability in different aspects (121,161-164), have been well established in the medical literature. They could act as excellent guidelines or references for radiomics study planning. But, consensus toward the standardized radiomics reliability assessment and reporting is yet to be reached by the whole community.

Conclusions

This study attempted to have an updated overview on the current status of radiomics reliability research from the perspective of using and reporting ICC in the ever-fast-expanding radiomics literature. The 481 eligible CT, PET, and MRI radiomics studies yielded from the literature search partially revealed the fact that much more efforts have been taken to rigorously assess radiomics reliability for clinical use, in particular in the recent two years. ICC was used for assessing different aspects of radiomics feature reliability in these studies, but feature reliability with respect to image segmentation was much more reported than reliability to other factors such as image acquisition, reconstruction, post-processing, and feature quantification. As indicated by the reported satisfactory ICCs in intra/inter-observer segmentation agreement, manual segmentation seems to be the least influential factor on radiomics reliability, but the risk of bias might be cautioned. The (semi-)automated segmentation may further increase segmentation agreement to further increase radiomics feature reliability with better cost-effectiveness/efficiency in the future. Image acquisition could introduce much more feature variability than image segmentation. More research on radiomics reliability with respect to image acquisition and reconstruction is desired. Comprehensive image post-processing and feature quantification techniques could be applied for radiomics analysis and yield different levels of radiomics reliability. Optimized comprehensive image post-processing and feature quantification could be used to mitigate image acquisition-induced variability and thus improve reliability. There were some common flaws and pitfalls in ICC use, as identified in many studies. Thus, some suggestions were given in order to mitigate them and to improve radiomics reliability research quality for future studies. Unfortunately, it was also recognized that the included studies were highly heterogeneous in characteristics and quality, greatly hampering the reliable data synthesis for further meta-analysis. Therefore, no consensus on the degree of radiomics reliability that has been achieved or could be achievable in radiomics research could be safely derived and reached by this review. More research works are warranted in the future.

Acknowledgments

Funding: This study was supported by hospital research project REC-2019-09. The authors have no relevant conflicts of interest to disclose.

Footnote

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://dx.doi.org/10.21037/qims-21-86). The authors have no conflicts of interest to declare. Dr. JY serves as an unpaid Associate Editor of Quantitative Imaging in Medicine and Surgery. The authors have no other conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Lambin P, Rios-Velazquez E, Leijenaar R, Carvalho S, van Stiphout RG, Granton P, Zegers CM, Gillies R, Boellard R, Dekker A, Aerts HJ. Radiomics: extracting more information from medical images using advanced feature analysis. Eur J Cancer 2012;48:441-6. [Crossref] [PubMed]
Gillies RJ, Kinahan PE, Hricak H. Radiomics: Images Are More than Pictures, They Are Data. Radiology 2016;278:563-77. [Crossref] [PubMed]
Avanzo M, Stancanello J, El Naqa I. Beyond imaging: The promise of radiomics. Phys Med 2017;38:122-39. [Crossref] [PubMed]
Lambin P, Leijenaar RTH, Deist TM, Peerlings J, de Jong EEC, van Timmeren J, et al. Radiomics: the bridge between medical imaging and personalized medicine. Nat Rev Clin Oncol 2017;14:749-62. [Crossref] [PubMed]
Song J, Yin Y, Wang H, Chang Z, Liu Z, Cui L. A review of original articles published in the emerging field of radiomics. Eur J Radiol 2020;127:108991 [Crossref] [PubMed]
Reuzé S, Schernberg A, Orlhac F, Sun R, Chargari C, Dercle L, Deutsch E, Buvat I, Robert C. Radiomics in Nuclear Medicine Applied to Radiation Therapy: Methods, Pitfalls, and Challenges. Int J Radiat Oncol Biol Phys 2018;102:1117-42. [Crossref] [PubMed]
Yip SS, Aerts HJ. Applications and limitations of radiomics. Phys Med Biol 2016;61:R150-66. [Crossref] [PubMed]
Miles K. Radiomics for personalised medicine: the long road ahead. Br J Cancer 2020;122:929-30. [Crossref] [PubMed]
Fornacon-Wood I, Faivre-Finn C, O'Connor JPB, Price GJ. Radiomics as a personalized medicine tool in lung cancer: Separating the hope from the hype. Lung Cancer 2020;146:197-208. [Crossref] [PubMed]
Pinto Dos Santos D, Dietzel M, Baessler B. A decade of radiomics research: are images really data or just patterns in the noise? Eur Radiol 2021;31:1-4. [Crossref] [PubMed]
Hatt M, Le Rest CC, Tixier F, Badic B, Schick U, Visvikis D. Radiomics: Data Are Also Images. J Nucl Med 2019;60:38S-44S. [Crossref] [PubMed]
Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979;86:420-8. [Crossref] [PubMed]
Koo TK, Li MY. A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. J Chiropr Med 2016;15:155-63. [Crossref] [PubMed]
Traverso A, Wee L, Dekker A, Gillies R. Repeatability and Reproducibility of Radiomic Features: A Systematic Review. Int J Radiat Oncol Biol Phys 2018;102:1143-58. [Crossref] [PubMed]
Whiting PF, Rutjes AW, Westwood ME, Mallett S, Deeks JJ, Reitsma JB, Leeflang MM, Sterne JA, Bossuyt PM. QUADAS-2 Group. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med 2011;155:529-36. [Crossref] [PubMed]
Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): the TRIPOD statement. Ann Intern Med 2015;162:55-63. [Crossref] [PubMed]
Aerts HJ, Grossmann P, Tan Y, Oxnard GR, Rizvi N, Schwartz LH, Zhao B. Defining a Radiomic Response Phenotype: A Pilot Study using targeted therapy in NSCLC. Sci Rep 2016;6:33860. [Crossref] [PubMed]
Hu P, Wang J, Zhong H, Zhou Z, Shen L, Hu W, Zhang Z. Reproducibility with repeat CT in radiomics study for rectal cancer. Oncotarget 2016;7:71440-6. [Crossref] [PubMed]
Huynh E, Coroller TP, Narayan V, Agrawal V, Hou Y, Romano J, Franco I, Mak RH, Aerts HJ. CT-based radiomic analysis of stereotactic body radiation therapy patients with lung cancer. Radiother Oncol 2016;120:258-66. [Crossref] [PubMed]
Huynh E, Coroller TP, Narayan V, Agrawal V, Romano J, Franco I, Parmar C, Hou Y, Mak RH, Aerts HJ. Associations of Radiomic Data Extracted from Static and Respiratory-Gated CT Scans with Disease Recurrence in Lung Cancer Patients Treated with SBRT. PLoS One 2017;12:e0169172 [Crossref] [PubMed]
Hosny A, Parmar C, Coroller TP, Grossmann P, Zeleznik R, Kumar A, Bussink J, Gillies RJ, Mak RH, Aerts HJWL. Deep learning for lung cancer prognostication: A retrospective multi-cohort radiomics study. PLoS Med 2018;15:e1002711 [Crossref] [PubMed]
Soufi M, Arimura H, Nagami N. Identification of optimal mother wavelets in survival prediction of lung cancer patients using wavelet decomposition-based radiomic features. Med Phys 2018;45:5116-28. [Crossref] [PubMed]
Dou TH, Coroller TP, van Griethuysen JJM, Mak RH, Aerts H. Peritumoral radiomics features predict distant metastasis in locally advanced NSCLC. PLoS One 2018;13:e0206108 [Crossref] [PubMed]
Huang L, Chen J, Hu W, Xu X, Liu D, Wen J, Lu J, Cao J, Zhang J, Gu Y, Wang J, Fan M. Assessment of a Radiomic Signature Developed in a General NSCLC Cohort for Predicting Overall Survival of ALK-Positive Patients With Different Treatment Types. Clin Lung Cancer 2019;20:e638-51. [Crossref] [PubMed]
Khorrami M, Khunger M, Zagouras A, Patil P, Thawani R, Bera K, Rajiah P, Fu P, Velcheti V, Madabhushi A. Combination of Peri- and Intratumoral Radiomic Features on Baseline CT Scans Predicts Response to Chemotherapy in Lung Adenocarcinoma. Radiol Artif Intell 2019;1:e180012 [Crossref] [PubMed]
Khorrami M, Jain P, Bera K, Alilou M, Thawani R, Patil P, Ahmad U, Murthy S, Stephans K, Fu P, Velcheti V, Madabhushi A. Predicting pathologic response to neoadjuvant chemoradiation in resectable stage III non-small cell lung cancer patients using computed tomography radiomic features. Lung Cancer 2019;135:1-9. [Crossref] [PubMed]
Osman SOS, Leijenaar RTH, Cole AJ, Lyons CA, Hounsell AR, Prise KM, O'Sullivan JM, Lambin P, McGarry CK, Jain S. Computed Tomography-based Radiomics for Risk Stratification in Prostate Cancer. Int J Radiat Oncol Biol Phys 2019;105:448-56. [Crossref] [PubMed]
Zwanenburg A, Leger S, Agolli L, Pilz K, Troost EGC, Richter C, Löck S. Assessing robustness of radiomic features by image perturbation. Sci Rep 2019;9:614. [Crossref] [PubMed]
Kadoya N, Tanaka S, Kajikawa T, Tanabe S, Abe K, Nakajima Y, Yamamoto T, Takahashi N, Takeda K, Dobashi S, Takeda K, Nakane K, Jingu K. Homology-based radiomic features for prediction of the prognosis of lung cancer based on CT-based radiomics. Med Phys 2020;47:2197-205. [Crossref] [PubMed]
Khorrami M, Bera K, Leo P, Vaidya P, Patil P, Thawani R, Velu P, Rajiah P, Alilou M, Choi H, Feldman MD, Gilkeson RC, Linden P, Fu P, Pass H, Velcheti V, Madabhushi A. Stable and discriminating radiomic predictor of recurrence in early stage non-small cell lung cancer: Multi-site study. Lung Cancer 2020;142:90-7. [Crossref] [PubMed]
Ligero M, Jordi-Ollero O, Bernatowicz K, Garcia-Ruiz A, Delgado-Muñoz E, Leiva D, Mast R, Suarez C, Sala-Llonch R, Calvo N, Escobar M, Navarro-Martin A, Villacampa G, Dienstmann R, Perez-Lopez R. Minimizing acquisition-related radiomics variability by image resampling and batch effect correction to allow for large-scale data analysis. Eur Radiol 2021;31:1460-70. [Crossref] [PubMed]
Prayer F, Hofmanninger J, Weber M, Kifjak D, Willenpart A, Pan J, Röhrich S, Langs G, Prosch H. Variability of computed tomography radiomics features of fibrosing interstitial lung disease: A test-retest study. Methods 2021;188:98-104. [Crossref] [PubMed]
Vuong D, Bogowicz M, Denzler S, Oliveira C, Foerster R, Amstutz F, Gabryś HS, Unkelbach J, Hillinger S, Thierstein S, Xyrafas A, Peters S, Pless M, Guckenberger M, Tanadini-Lang S. Comparison of robust to standardized CT radiomics models to predict overall survival for non-small cell lung cancer patients. Med Phys 2020;47:4045-53. [Crossref] [PubMed]
Yamashita R, Perrin T, Chakraborty J, Chou JF, Horvat N, Koszalka MA, Midya A, Gonen M, Allen P, Jarnagin WR, Simpson AL, Do RKG. Radiomic feature reproducibility in contrast-enhanced CT of the pancreas is affected by variabilities in scan parameters and manual segmentation. Eur Radiol 2020;30:195-205. [Crossref] [PubMed]
Fiset S, Welch ML, Weiss J, Pintilie M, Conway JL, Milosevic M, Fyles A, Traverso A, Jaffray D, Metser U, Xie J, Han K. Repeatability and reproducibility of MRI-based radiomic features in cervical cancer. Radiother Oncol 2019;135:107-14. [Crossref] [PubMed]
Li Z, Duan H, Zhao K, Ding Y. Stability of MRI Radiomics Features of Hippocampus: An Integrated Analysis of Test-Retest and Inter-Observer Variability. IEEE Access 2019;7:97106-16.
Zinn PO, Singh SK, Kotrotsou A, Hassan I, Thomas G, Luedi MM, et al. A Coclinical Radiogenomic Validation Study: Conserved Magnetic Resonance Radiomic Appearance of Periostin-Expressing Glioblastoma in Patients and Xenograft Models. Clin Cancer Res 2018;24:6288-99. [Crossref] [PubMed]
Bologna M, Corino V, Tenconi C, Facchinetti N, Calareso G, Iacovelli N, Cavallo A, Alfieri S, Cavalieri S, Fallai C, Valdagni R, Rancati T, Trama A, Licitra L, Orlandi E, Mainardi L. Methodology and technology for the development of a prognostic MRI-based radiomic model for the outcome of head and neck cancer patients. Annu Int Conf IEEE Eng Med Biol Soc 2020;2020:1152-5. [PubMed]
Jang J, Ngo LH, Mancio J, Kucukseymen S, Rodriguez J, Pierce P, Goddu B, Nezafat R. Reproducibility of Segmentation-based Myocardial Radiomic Features with Cardiac MRI. Radiol Cardiothorac Imaging 2020;2:e190216 [Crossref] [PubMed]
Merisaari H, Taimen P, Shiradkar R, Ettala O, Pesola M, Saunavaara J, Boström PJ, Madabhushi A, Aronen HJ, Jambor I. Repeatability of radiomics and machine learning for DWI: Short-term repeatability study of 112 patients with prostate cancer. Magn Reson Med 2020;83:2293-309. [Crossref] [PubMed]
Pandey U, Saini J, Kumar M, Gupta R, Ingalhalikar M. Normative Baseline for Radiomics in Brain MRI: Evaluating the Robustness, Regional Variations, and Reproducibility on FLAIR Images. J Magn Reson Imaging 2021;53:394-407. [Crossref] [PubMed]
Gutmann DAP, Rospleszcz S, Rathmann W, Schlett CL, Peters A, Wachinger C, Gatidis S, Bamberg F. MRI-Derived Radiomics Features of Hepatic Fat Predict Metabolic States in Individuals without Cardiovascular Disease. Acad Radiol 2020; Epub ahead of print. [Crossref] [PubMed]
Scalco E, Belfatto A, Mastropietro A, Rancati T, Avuzzi B, Messina A, Valdagni R, Rizzo G. T2w-MRI signal normalization affects radiomics features reproducibility. Med Phys 2020;47:1680-91. [Crossref] [PubMed]
Shiri I, Hajianfar G, Sohrabi A, Abdollahi H, P, Shayesteh S, Geramifar P, Zaidi H, Oveisi M, Rahmim A. Repeatability of radiomic features in magnetic resonance imaging of glioblastoma: Test-retest and image registration analyses. Med Phys 2020;47:4265-80. [Crossref] [PubMed]
Ta D, Khan M, Ishaque A, Seres P, Eurich D, Yang YH, Kalra S. Reliability of 3D texture analysis: A multicenter MRI study of the brain. J Magn Reson Imaging 2020;51:1200-9. [Crossref] [PubMed]
Zhang J, Yao K, Liu P, Liu Z, Han T, Zhao Z, Cao Y, Zhang G, Zhang J, Tian J, Zhou J. A radiomics model for preoperative prediction of brain invasion in meningioma non-invasively based on MRI: A multicentre study. EBioMedicine 2020;58:102933 [Crossref] [PubMed]
Han Y, Yang Y, Shi ZS, Zhang AD, Yan LF, Hu YC, Feng LL, Ma J, Wang W, Cui GB. Distinguishing brain inflammation from grade II glioma in population without contrast enhancement: a radiomics analysis based on conventional MRI. Eur J Radiol 2021;134:109467 [Crossref] [PubMed]
Leijenaar RT, Carvalho S, Velazquez ER, van Elmpt WJ, Parmar C, Hoekstra OS, Hoekstra CJ, Boellaard R, Dekker AL, Gillies RJ, Aerts HJ, Lambin P. Stability of FDG-PET Radiomics features: an integrated analysis of test-retest and inter-observer variability. Acta Oncol 2013;52:1391-7. [Crossref] [PubMed]
Willaime JM, Turkheimer FE, Kenny LM, Aboagye EO. Quantification of intra-tumour cell proliferation heterogeneity using imaging descriptors of 18F fluorothymidine-positron emission tomography. Phys Med Biol 2013;58:187-203. [Crossref] [PubMed]
van Velden FH, Nissen IA, Jongsma F, Velasquez LM, Hayes W, Lammertsma AA, Hoekstra OS, Boellaard R. Test-retest variability of various quantitative measures to characterize tracer uptake and/or tracer uptake heterogeneity in metastasized liver for patients with colorectal carcinoma. Mol Imaging Biol 2014;16:13-8. [Crossref] [PubMed]
Cheng NM, Fang YH, Tsan DL, Hsu CH, Yen TC. Respiration-Averaged CT for Attenuation Correction of PET Images - Impact on PET Texture Features in Non-Small Cell Lung Cancer Patients. PLoS One 2016;11:e0150509 [Crossref] [PubMed]
van Rossum PS, Fried DV, Zhang L, Hofstetter WL, van Vulpen M, Meijer GJ, Court LE, Lin SH. The Incremental Value of Subjective and Quantitative Assessment of 18F-FDG PET for the Prediction of Pathologic Complete Response to Preoperative Chemoradiotherapy in Esophageal Cancer. J Nucl Med 2016;57:691-700. [Crossref] [PubMed]
Carvalho S, Leijenaar RTH, Troost EGC, van Timmeren JE, Oberije C, van Elmpt W, de Geus-Oei LF, Bussink J, Lambin P. 18F-fluorodeoxyglucose positron-emission tomography (FDG-PET)-Radiomics of metastatic lymph nodes and primary tumor in non-small cell lung cancer (NSCLC) - A prospective externally validated study. PLoS One 2018;13:e0192859 [Crossref] [PubMed]
Jiang Y, Yuan Q, Lv W, Xi S, Huang W, Sun Z, Chen H, Zhao L, Liu W, Hu Y, Lu L, Ma J, Li T, Yu J, Wang Q, Li G. Radiomic signature of (18)F fluorodeoxyglucose PET/CT for prediction of gastric cancer survival and chemotherapeutic benefits. Theranostics 2018;8:5915-28. [Crossref] [PubMed]
Lin C, Harmon S, Bradshaw T, Eickhoff J, Perlman S, Liu G, Jeraj R. Response-to-repeatability of quantitative imaging features for longitudinal response assessment. Phys Med Biol 2019;64:025019 [Crossref] [PubMed]
Manabe O, Ohira H, Hirata K, Hayashi S, Naya M, Tsujino I, Aikawa T, Koyanagawa K, Oyama-Manabe N, Tomiyama Y, Magota K, Yoshinaga K, Tamaki N. Use of (18)F-FDG PET/CT texture analysis to diagnose cardiac sarcoidosis. Eur J Nucl Med Mol Imaging 2019;46:1240-7. [Crossref] [PubMed]
Vuong D, Tanadini-Lang S, Huellner MW, Veit-Haibach P, Unkelbach J, Andratschke N, Kraft J, Guckenberger M, Bogowicz M. Interchangeability of radiomic features between [18F]-FDG PET/CT and [18F]-FDG PET/MR. Med Phys 2019;46:1677-85. [Crossref] [PubMed]
Desseroit MC, Tixier F, Weber WA, Siegel BA, Cheze Le Rest C, Visvikis D, Hatt M. Reliability of PET/CT Shape and Heterogeneity Features in Functional and Morphologic Components of Non-Small Cell Lung Cancer Tumors: A Repeatability Analysis in a Prospective Multicenter Cohort. J Nucl Med 2017;58:406-11. [Crossref] [PubMed]
Jiang J, Zhou H, Duan H, Liu X, Zuo C, Huang Z, Yu Z, Yan ZAlzheimer's Disease Neuroimaging Initiative. A novel individual-level morphological brain networks constructing method and its evaluation in PET and MR images. Heliyon 2017;3:e00475 [Crossref] [PubMed]
Ahn SJ, Kim JH, Lee SM, Park SJ, Han JK. CT reconstruction algorithms affect histogram and texture analysis: evidence for liver parenchyma, focal solid liver lesions, and renal cysts. Eur Radiol 2019;29:4008-15. [Crossref] [PubMed]
Kolossváry M, Szilveszter B, Karády J, Drobni ZD, Merkely B, Maurovich-Horvat P. Effect of image reconstruction algorithms on volumetric and radiomic parameters of coronary plaques. J Cardiovasc Comput Tomogr 2019;13:325-30. [Crossref] [PubMed]
Koo HJ, Sung YS, Shim WH, Xu H, Choi CM, Kim HR, Lee JB, Kim MY. Quantitative Computed Tomography Features for Predicting Tumor Recurrence in Patients with Surgically Resected Adenocarcinoma of the Lung. PLoS One 2017;12:e0167955 [Crossref] [PubMed]
Lee SH, Cho HH, Lee HY, Park H. Clinical impact of variability on CT radiomics and suggestions for suitable feature selection: a focus on lung cancer. Cancer Imaging 2019;19:54. [Crossref] [PubMed]
Suter Y, Knecht U, Alão M, Valenzuela W, Hewer E, Schucht P, Wiest R, Reyes M. Radiomics for glioblastoma survival analysis in pre-operative MRI: exploring feature robustness, class boundaries, and machine learning techniques. Cancer Imaging 2020;20:55. [Crossref] [PubMed]
Altazi BA, Zhang GG, Fernandez DC, Montejo ME, Hunt D, Werner J, Biagioli MC, Moros EG. Reproducibility of F18-FDG PET radiomic features for different cervical tumor segmentation methods, gray-level discretization, and reconstruction algorithms. J Appl Clin Med Phys 2017;18:32-48. [Crossref] [PubMed]
van Velden FH, Kramer GM, Frings V, Nissen IA, Mulder ER, de Langen AJ, Hoekstra OS, Smit EF, Boellaard R. Repeatability of Radiomic Features in Non-Small-Cell Lung Cancer [(18)F]FDG-PET/CT Studies: Impact of Reconstruction and Delineation. Mol Imaging Biol 2016;18:788-95. [Crossref] [PubMed]
Parmar C, Rios Velazquez E, Leijenaar R, Jermoumi M, Carvalho S, Mak RH, Mitra S, Shankar BU, Kikinis R, Haibe-Kains B, Lambin P, Aerts HJ. Robust Radiomics feature quantification using semiautomatic volumetric segmentation. PLoS One 2014;9:e102107 [Crossref] [PubMed]
Echegaray S, Gevaert O, Shah R, Kamaya A, Louie J, Kothary N, Napel S. Core samples for radiomics features that are insensitive to tumor segmentation: method and pilot study using CT images of hepatocellular carcinoma. J Med Imaging (Bellingham) 2015;2:041011 [Crossref] [PubMed]
Echegaray S, Nair V, Kadoch M, Leung A, Rubin D, Gevaert O, Napel S. A Rapid Segmentation-Insensitive "Digital Biopsy" Method for Radiomic Feature Extraction: Method and Pilot Study Using CT Images of Non-Small Cell Lung Cancer. Tomography 2016;2:283-94. [Crossref] [PubMed]
Qiu Q, Duan J, Gong G, Lu Y, Li D, Lu J. Reproducibility of radiomic features with GrowCut and GraphCut semiautomatic tumor segmentation in hepatocellular carcinoma. Transl Cancer Res 2017;6:940-8. [Crossref]
Owens CA, Peterson CB, Tang C, Koay EJ, Yu W, Mackin DS, Li J, Salehpour MR, Fuentes DT, Court LE, Yang J. Lung tumor segmentation methods: Impact on the uncertainty of radiomics features for non-small cell lung cancer. PLoS One 2018;13:e0205003 [Crossref] [PubMed]
Pavic M, Bogowicz M, Würms X, Glatz S, Finazzi T, Riesterer O, Roesch J, Rudofsky L, Friess M, Veit-Haibach P, Huellner M, Opitz I, Weder W, Frauenfelder T, Guckenberger M, Tanadini-Lang S. Influence of inter-observer delineation variability on radiomics stability in different tumor sites. Acta Oncol 2018;57:1070-4. [Crossref] [PubMed]
Kocak B, Ates E, Durmaz ES, Ulusan MB, Kilickesmez O. Influence of segmentation margin on machine learning-based high-dimensional quantitative CT texture analysis: a reproducibility study on renal clear cell carcinomas. Eur Radiol 2019;29:4765-75. [Crossref] [PubMed]
Kocak B, Durmaz ES, Kaya OK, Ates E, Kilickesmez O. Reliability of Single-Slice-Based 2D CT Texture Analysis of Renal Masses: Influence of Intra- and Interobserver Manual Segmentation Variability on Radiomic Feature Reproducibility. AJR Am J Roentgenol 2019;213:377-83. [Crossref] [PubMed]
Moltz JH (editor). Stability of radiomic features of liver lesions from manual delineation in CT scans. Medical Imaging 2019: Computer-Aided Diagnosis. Bellingham, Washington: International Society for Optics and Photonics, 2019.
Mori M, Benedetti G, Partelli S, Sini C, Andreasi V, Broggi S, Barbera M, Cattaneo GM, Muffatti F, Panzeri M, Falconi M, Fiorino C, De Cobelli F. Ct radiomic features of pancreatic neuroendocrine neoplasms (panNEN) are robust against delineation uncertainty. Phys Med 2019;57:41-6. [Crossref] [PubMed]
Qiu Q, Duan J, Duan Z, Meng X, Ma C, Zhu J, Lu J, Liu T, Yin Y. Reproducibility and non-redundancy of radiomic features extracted from arterial phase CT scans in hepatocellular carcinoma patients: impact of tumor segmentation variability. Quant Imaging Med Surg 2019;9:453-64. [Crossref] [PubMed]
Sung P, Lee JM, Joo I, Lee S, Kim TH, Ganeshan B. Evaluation of the Impact of Iterative Reconstruction Algorithms on Computed Tomography Texture Features of the Liver Parenchyma Using the Filtration-Histogram Method. Korean J Radiol 2019;20:558-68. [Crossref] [PubMed]
Uthoff J, Nagpal P, Sanchez R, Gross TJ, Lee C, Sieren JC. Differentiation of non-small cell lung cancer and histoplasmosis pulmonary nodules: insights from radiomics model performance compared with clinician observers. Transl Lung Cancer Res 2019;8:979-88. [Crossref] [PubMed]
Caballo M, Pangallo DR, Mann RM, Sechopoulos I. Deep learning-based segmentation of breast masses in dedicated breast CT imaging: Radiomic feature stability between radiologists and artificial intelligence. Comput Biol Med 2020;118:103629 [Crossref] [PubMed]
Haarburger C, Muller-Franzes G, Weninger L, Kuhl C, Truhn D, Merhof D. Radiomics feature reproducibility under inter-rater variability in segmentations of CT images. Sci Rep 2020;10:12688. [Crossref] [PubMed]
Kakino R, Nakamura M, Mitsuyoshi T, Shintani T, Hirashima H, Matsuo Y, Mizowaki T. Comparison of radiomic features in diagnostic CT images with and without contrast enhancement in the delayed phase for NSCLC patients. Phys Med 2020;69:176-82. [Crossref] [PubMed]
Kulkarni A, Carrion-Martinez I, Dhindsa K, Alaref AA, Rozenberg R, van der Pol CB. Pancreas adenocarcinoma CT texture analysis: comparison of 3D and 2D tumor segmentation techniques. Abdom Radiol (NY) 2021;46:1027-33. [Crossref] [PubMed]
Liu R, Elhalawani H, Radwan Mohamed AS, Elgohari B, Court L, Zhu H, Fuller CD. Stability analysis of CT radiomic features with respect to segmentation variation in oropharyngeal cancer. Clin Transl Radiat Oncol 2019;21:11-8. [Crossref] [PubMed]
Nguyen K, Schieda N, James N, McInnes MDF, Wu M, Thornhill RE. Effect of phase of enhancement on texture analysis in renal masses evaluated with non-contrast-enhanced, corticomedullary, and nephrographic phase-enhanced CT images. Eur Radiol 2021;31:1676-86. [Crossref] [PubMed]
Ren J, Yuan Y, Qi M, Tao X. Machine learning-based CT texture analysis to predict HPV status in oropharyngeal squamous cell carcinoma: comparison of 2D and 3D segmentation. Eur Radiol 2020;30:6858-66. [Crossref] [PubMed]
Adduru VR, Michael AM, Helguera M, Baum SA, Moore GJ. Leveraging Clinical Imaging Archives for Radiomics: Reliability of Automated Methods for Brain Volume Measurement. Radiology 2017;284:862-9. [Crossref] [PubMed]
Lee M, Woo B, Kuo MD, Jamshidi N, Kim JH. Quality of Radiomic Features in Glioblastoma Multiforme: Impact of Semi-Automated Tumor Segmentation Software. Korean J Radiol 2017;18:498-509. [Crossref] [PubMed]
Bologna M, Corino VDA, Montin E, Messina A, Calareso G, Greco FG, Sdao S, Mainardi LT. Assessment of Stability and Discrimination Capacity of Radiomic Features on Apparent Diffusion Coefficient Images. J Digit Imaging 2018;31:879-94. [Crossref] [PubMed]
Saha A, Harowicz MR, Mazurowski MA. Breast cancer MRI radiomics: An overview of algorithmic features and impact of inter-reader variability in annotating tumors. Med Phys 2018;45:3076-85. [Crossref] [PubMed]
Duron L, Balvay D, Vande Perre S, Bouchouicha A, Savatovsky J, Sadik JC, Thomassin-Naggara I, Fournier L, Lecler A. Gray-level discretization impacts reproducible MRI radiomics texture features. PLoS One 2019;14:e0213459 [Crossref] [PubMed]
Koçak B. Reliability of 2D Magnetic Resonance Imaging Texture Analysis in Cerebral Gliomas: Influence of Slice Selection Bias on Reproducibility of Radiomic Features. Istanbul Med J 2019;20: [Crossref]
Lecler A, Duron L, Balvay D, Savatovsky J, Bergès O, Zmuda M, Farah E, Galatoire O, Bouchouicha A, Fournier LS. Combining Multiple Magnetic Resonance Imaging Sequences Provides Independent Reproducible Radiomics Features. Sci Rep 2019;9:2068. [Crossref] [PubMed]
Tixier F, Um H, Young RJ, Veeraraghavan H. Reliability of tumor segmentation in glioblastoma: Impact on the robustness of MRI-radiomic features. Med Phys 2019;46:3582-91. [Crossref] [PubMed]
Traverso A, Kazmierski M, Welch ML, Weiss J, Fiset S, Foltz WD, Gladwish A, Dekker A, Jaffray D, Wee L, Han K. Sensitivity of radiomic features to inter-observer variability and image pre-processing in Apparent Diffusion Coefficient (ADC) maps of cervix cancer patients. Radiother Oncol 2020;143:88-94. [Crossref] [PubMed]
Alis D, Yergin M, Asmakutlu O, Topel C, Karaarslan E. The influence of cardiac motion on radiomics features: radiomics features of non-enhanced CMR cine images greatly vary through the cardiac cycle. Eur Radiol 2021;31:2706-15. [Crossref] [PubMed]
Chen H, He Y, Zhao C, Zheng L, Pan N, Qiu J, Zhang Z, Niu X, Yuan Z. Reproducibility of radiomics features derived from intravoxel incoherent motion diffusion-weighted MRI of cervical cancer. Acta Radiol 2021;62:679-86. [Crossref] [PubMed]
Granzier RWY, Verbakel NMH, Ibrahim A, van Timmeren JE, van Nijnatten TJA, Leijenaar RTH, Lobbes MBI, Smidt ML, Woodruff HC. MRI-based radiomics in breast cancer: feature robustness with respect to inter-observer segmentation variability. Sci Rep 2020;10:14163. [Crossref] [PubMed]
Lin YC, Lin CH, Lu HY, Chiang HJ, Wang HK, Huang YT, Ng SH, Hong JH, Yen TC, Lai CH, Lin G. Deep learning for fully automated tumor segmentation and extraction of magnetic resonance radiomics features in cervical cancer. Eur Radiol 2020;30:1297-305. [Crossref] [PubMed]
Pati S, Verma R, Akbari H, Bilello M, Hill VB, Sako C, Correa R, Beig N, Venet L, Thakur S, Serai P, Ha SM, Blake GD, Shinohara RT, Tiwari P, Bakas S. Reproducibility analysis of multi-institutional paired expert annotations and radiomic features of the Ivy Glioblastoma Atlas Project (Ivy GAP) dataset. Med Phys 2020;47:6039-52. [Crossref] [PubMed]
Lu L, Lv W, Jiang J, Ma J, Feng Q, Rahmim A, Chen W. Robustness of Radiomic Features in [(11)C]Choline and [(18)F]FDG PET/CT Imaging of Nasopharyngeal Carcinoma: Impact of Segmentation and Discretization. Mol Imaging Biol 2016;18:935-45. [Crossref] [PubMed]
Bashir U, Azad G, Siddique MM, Dhillon S, Patel N, Bassett P, Landau D, Goh V, Cook G. The effects of segmentation algorithms on the measurement of (18)F-FDG PET texture parameters in non-small cell lung cancer. EJNMMI Res 2017;7:60. [Crossref] [PubMed]
Belli ML, Mori M, Broggi S, Cattaneo GM, Bettinardi V, Dell'Oca I, Fallanca F, Passoni P, Vanoli EG, Calandrino R, Di Muzio N, Picchio M, Fiorino C. Quantifying the robustness of [(18)F]FDG-PET/CT radiomic features with respect to tumor delineation in head and neck and pancreatic cancer patients. Phys Med 2018;49:105-11. [Crossref] [PubMed]
Yang P, Xu L, Cao Z, Wan Y, Xue Y, Jiang Y, Yen E, Luo C, Wang J, Rong Y, Niu T. Extracting and Selecting Robust Radiomic Features from PET/MR Images in Nasopharyngeal Carcinoma. Mol Imaging Biol 2020;22:1581-91. [Crossref] [PubMed]
Bogowicz M, Riesterer O, Bundschuh RA, Veit-Haibach P, Hüllner M, Studer G, Stieb S, Glatz S, Pruschy M, Guckenberger M, Tanadini-Lang S. Stability of radiomic features in CT perfusion maps. Phys Med Biol 2016;61:8736-49. [Crossref] [PubMed]
Ger RB, Zhou S, Chi PM, Lee HJ, Layman RR, Jones AK, Goff DL, Fuller CD, Howell RM, Li H, Stafford RJ, Court LE, Mackin DS. Comprehensive Investigation on Controlling for CT Imaging Variabilities in Radiomics Studies. Sci Rep 2018;8:13047. [Crossref] [PubMed]
Shafiq-Ul-Hassan M, Latifi K, Zhang G, Ullah G, Gillies R, Moros E. Voxel size and gray level normalization of CT radiomic features in lung cancer. Sci Rep 2018;8:10545. [Crossref] [PubMed]
Defeudis A, De Mattia C, Rizzetto F, Calderoni F, Mazzetti S, Torresin A, Vanzulli A, Regge D, Giannini V. Standardization of CT radiomics features for multi-center analysis: impact of software settings and parameters. Phys Med Biol 2020;65:195012 [Crossref] [PubMed]
Park BW, Kim JK, Heo C, Park KJ. Reliability of CT radiomic features reflecting tumour heterogeneity according to image quality and image processing parameters. Sci Rep 2020;10:3852. [Crossref] [PubMed]
Kim D, Wang N, Ravikumar V, Raghuram DR, Li J, Patel A, Wendt RE 3rd, Rao G, Rao A. Prediction of 1p/19q Codeletion in Diffuse Glioma Patients Using Pre-operative Multiparametric Magnetic Resonance Imaging. Front Comput Neurosci 2019;13:52. [Crossref] [PubMed]
Schwier M, van Griethuysen J, Vangel MG, Pieper S, Peled S, Tempany C, Aerts HJWL, Kikinis R, Fennessy FM, Fedorov A. Repeatability of Multiparametric Prostate MRI Radiomics Features. Sci Rep 2019;9:9441. [Crossref] [PubMed]
Fan M, Liu Z, Xu M, Wang S, Zeng T, Gao X, Li L. Generative adversarial network-based super-resolution of diffusion-weighted imaging: Application to tumour radiomics in breast cancer. NMR Biomed 2020;33:e4345 [Crossref] [PubMed]
Moradmand H, Aghamiri SMR, Ghaderi R. Impact of image preprocessing methods on reproducibility of radiomic features in multimodal magnetic resonance imaging in glioblastoma. J Appl Clin Med Phys 2020;21:179-90. [Crossref] [PubMed]
Branchini M, Zorz A, Zucchetta P, Bettinelli A, De Monte F, Cecchin D, Paiusco M. Impact of acquisition count statistics reduction and SUV discretization on PET radiomic features in pediatric 18F-FDG-PET/MRI examinations. Phys Med 2019;59:117-26. [Crossref] [PubMed]
Whybra P, Parkinson C, Foley K, Staffurth J, Spezi E. Assessing radiomic feature robustness to interpolation in (18)F-FDG PET imaging. Sci Rep 2019;9:9649. [Crossref] [PubMed]
Foy JJ, Robinson KR, Li H, Giger ML, Al-Hallaq H, Armato SG 3rd. Variation in algorithm implementation across radiomics software. J Med Imaging (Bellingham) 2018;5:044505 [Crossref] [PubMed]
Tixier F, Hatt M, Le Rest CC, Le Pogam A, Corcos L, Visvikis D. Reproducibility of tumor uptake heterogeneity characterization through textural feature analysis in 18F-FDG PET. J Nucl Med 2012;53:693-700. [Crossref] [PubMed]
Leijenaar RT, Nalbantov G, Carvalho S, van Elmpt WJ, Troost EG, Boellaard R, Aerts HJ, Gillies RJ, Lambin P. The effect of SUV discretization in quantitative FDG-PET Radiomics: the need for standardized methodology in tumor texture analysis. Sci Rep 2015;5:11075. [Crossref] [PubMed]
Bogowicz M, Leijenaar RTH, Tanadini-Lang S, Riesterer O, Pruschy M, Studer G, Unkelbach J, Guckenberger M, Konukoglu E, Lambin P. Post-radiochemotherapy PET radiomics in head and neck cancer - The influence of radiomics implementation on the reproducibility of local control tumor models. Radiother Oncol 2017;125:385-91. [Crossref] [PubMed]
Lv W, Yuan Q, Wang Q, Ma J, Jiang J, Yang W, Feng Q, Chen W, Rahmim A, Lu L. Robustness versus disease differentiation when varying parameter settings in radiomics features: application to nasopharyngeal PET/CT. Eur Radiol 2018;28:3245-54. [Crossref] [PubMed]
Raunig DL, McShane LM, Pennello G, Gatsonis C, Carson PL, Voyvodic JT, Wahl RL, Kurland BF, Schwarz AJ, Gönen M, Zahlmann G, Kondratovich MV, O'Donnell K, Petrick N, Cole PE, Garra B, Sullivan DCQIBA Technical Performance Working Group. Quantitative imaging biomarkers: a review of statistical methods for technical performance assessment. Stat Methods Med Res 2015;24:27-67. [Crossref] [PubMed]
Berenguer R, Pastor-Juan MDR, Canales-Vázquez J, Castro-García M, Villas MV, Mansilla Legorburo F, Sabater S. Radiomics of CT Features May Be Nonreproducible and Redundant: Influence of CT Acquisition Parameters. Radiology 2018;288:407-15. [Crossref] [PubMed]
Mannil M, von Spiczak J, Hermanns T, Alkadhi H, Fankhauser CD. Prediction of successful shock wave lithotripsy with CT: a phantom study using texture analysis. Abdom Radiol (NY) 2018;43:1432-8. [Crossref] [PubMed]
Li Y, Tan G, Vangel M, Hall J, Cai W. Influence of feature calculating parameters on the reproducibility of CT radiomic features: a thoracic phantom study. Quant Imaging Med Surg 2020;10:1775-85. [Crossref] [PubMed]
Nardone V, Reginelli A, Guida C, Belfiore MP, Biondi M, Mormile M, Banci Buonamici F, Di Giorgio E, Spadafora M, Tini P, Grassi R, Pirtoli L, Correale P, Cappabianca S, Grassi R. Delta-radiomics increases multicentre reproducibility: a phantom study. Med Oncol 2020;37:38. [Crossref] [PubMed]
Song YS, Park CM, Lee SM, Park SJ, Cho HR, Choi SH, Lee JM, Kiefer B, Goo JM. Reproducibility of histogram and texture parameters derived from intravoxel incoherent motion diffusion-weighted MRI of FN13762 rat breast Carcinomas. Anticancer Res 2014;34:2135-44. [PubMed]
Baeßler B, Weiss K, Pinto Dos Santos D. Robustness and Reproducibility of Radiomics in Magnetic Resonance Imaging: A Phantom Study. Invest Radiol 2019;54:221-8. [Crossref] [PubMed]
Bologna M, Corino VDA, Mainardi LT. Assessment of the effect of intensity standardization on the reliability of T1-weighted MRI radiomic features: experiment on a virtual phantom. Annu Int Conf IEEE Eng Med Biol Soc 2019;2019:413-6. [Crossref] [PubMed]
Bologna M, Corino V, Mainardi L. Technical Note: Virtual phantom analyses for preprocessing evaluation and detection of a robust feature set for MRI-radiomics of the brain. Med Phys 2019;46:5116-23. [Crossref] [PubMed]
Cattell R, Chen S, Huang C. Robustness of radiomic features in magnetic resonance imaging: review and a phantom study. Vis Comput Ind Biomed Art 2019;2:19. [Crossref] [PubMed]
Bianchini L, Botta F, Origgi D, Rizzo S, Mariani M, Summers P, García-Polo P, Cremonesi M, Lascialfari A. PETER PHAN: An MRI phantom for the optimisation of radiomic studies of the female pelvis. Phys Med 2020;71:71-81. [Crossref] [PubMed]
Dreher C, Kuder TA, König F, Mlynarska-Bujny A, Tenconi C, Paech D, Schlemmer HP, Ladd ME, Bickelhaupt S. Radiomics in diffusion data: a test-retest, inter- and intra-reader DWI phantom study. Clin Radiol 2020;75:798 e13-22.
Eresen A, Yang J, Shangguan J, Benson AB, Yaghmai V, Zhang Z. Detection of Immunotherapeutic Response in a Transgenic Mouse Model of Pancreatic Ductal Adenocarcinoma Using Multiparametric MRI Radiomics: A Preliminary Investigation. Acad Radiol 2021;28:e147-54. [Crossref] [PubMed]
Rai R, Holloway LC, Brink C, Field M, Christiansen RL, Sun Y, Barton MB, Liney GP. Multicenter evaluation of MRI-based radiomic features: A phantom study. Med Phys 2020;47:3054-63. [Crossref] [PubMed]
Bianchini L, Santinha J, Loução N, Figueiredo M, Botta F, Origgi D, Cremonesi M, Cassano E, Papanikolaou N, Lascialfari A. A multicenter study on radiomic features from T2 -weighted images of a customized MR pelvic phantom setting the basis for robust radiomic models in clinics. Magn Reson Med 2021;85:1713-26. [Crossref] [PubMed]
Gallivanone F, Interlenghi M, D'Ambrosio D, Trifiro G, Castiglioni I. Parameters Influencing PET Imaging Features: A Phantom Study with Irregular and Heterogeneous Synthetic Lesions. Contrast Media Mol Imaging 2018;2018:5324517 [Crossref] [PubMed]
Ger RB, Meier JG, Pahlka RB, Gay S, Mumme R, Fuller CD, Li H, Howell RM, Layman RR, Stafford RJ, Zhou S, Mawlawi O, Court LE. Effects of alterations in positron emission tomography imaging parameters on radiomics features. PLoS One 2019;14:e0221877 [Crossref] [PubMed]
Pfaehler E, Beukinga RJ, de Jong JR, Slart RHJA, Slump CH, Dierckx RAJO, Boellaard R. Repeatability of (18) F-FDG PET radiomic features: A phantom study to explore sensitivity to image reconstruction settings, noise, and delineation method. Med Phys 2019;46:665-78. [Crossref] [PubMed]
Pfaehler E, van Sluis J, Merema BBJ, van Ooijen P, Berendsen RCM, van Velden FHP, Boellaard R. Experimental Multicenter and Multivendor Evaluation of the Performance of PET Radiomic Features Using 3-Dimensionally Printed Phantom Inserts. J Nucl Med 2020;61:469-76. [Crossref] [PubMed]
Yang F, Simpson G, Young L, Ford J, Dogan N, Wang L. Impact of contouring variability on oncological PET radiomics features in the lung. Sci Rep 2020;10:369. [Crossref] [PubMed]
Qiao M, Li C, Suo S, Cheng F, Hua J, Xue D, Guo Y, Xu J, Wang Y. Breast DCE-MRI radiomics: a robust computer-aided system based on reproducible BI-RADS features across the influence of datasets bias and segmentation methods. Int J Comput Assist Radiol Surg 2020;15:921-30. [Crossref] [PubMed]
Qiao M, Li C, Suo S, Cheng F, Hua J, Xue D, Guo Y, Xu J, Wang Y. Diffusion and perfusion MRI radiomics obtained from deep learning segmentation provides reproducible and comparable diagnostic model to human in post-treatment glioblastoma. Eur Radiol 2020;15:921-30.
Kunimatsu A, Kunimatsu N, Kamiya K, Watadani T, Mori H, Abe O. Comparison between Glioblastoma and Primary Central Nervous System Lymphoma Using MR Image-based Texture Analysis. Magn Reson Med Sci 2018;17:50-7. [Crossref] [PubMed]
Sutton EJ, Huang EP, Drukker K, Burnside ES, Li H, Net JM, Rao A, Whitman GJ, Zuley M, Ganott M, Bonaccio E, Giger ML, Morris EATCGA group. Breast MRI radiomics: comparison of computer-and human-extracted imaging phenotypes. Eur Radiol Exp 2017;1:22. [Crossref] [PubMed]
Hinzpeter R, Wagner MW, Wurnig MC, Seifert B, Manka R, Alkadhi H. Texture analysis of acute myocardial infarction with CT: First experience study. PLoS One 2017;12:e0186876 [Crossref] [PubMed]
Zwanenburg A, Vallieres M, Abdalah MA, Aerts H, Andrearczyk V, Apte A, et al. The Image Biomarker Standardization Initiative: Standardized Quantitative Radiomics for High-Throughput Image-based Phenotyping. Radiology 2020;295:328-38. [Crossref] [PubMed]
Welch ML, McIntosh C, Haibe-Kains B, Milosevic MF, Wee L, Dekker A, Huang SH, Purdie TG, O'Sullivan B, Aerts HJWL, Jaffray DA. Vulnerabilities of radiomic signature development: The need for safeguards. Radiother Oncol 2019;130:2-9. [Crossref] [PubMed]
Summers RM. Are we at a crossroads or a plateau? Radiomics and machine learning in abdominal oncology imaging. Abdom Radiol (NY) 2019;44:1985-9. [Crossref] [PubMed]
Stanzione A, Gambardella M, Cuocolo R, Ponsiglione A, Romeo V, Imbriaco M. Prostate MRI radiomics: A systematic review and radiomic quality score assessment. Eur J Radiol 2020;129:109095 [Crossref] [PubMed]
Park JE, Kim D, Kim HS, Park SY, Kim JY, Cho SJ, Shin JH, Kim JH. Quality of science and reporting of radiomics in oncologic studies: room for improvement according to radiomics quality score and TRIPOD statement. Eur Radiol 2020;30:523-36. [Crossref] [PubMed]
Sanduleanu S, Woodruff HC, de Jong EEC, van Timmeren JE, Jochems A, Dubois L, Lambin P. Tracking tumor biology with radiomics: A systematic review utilizing a radiomics quality score. Radiother Oncol 2018;127:349-60. [Crossref] [PubMed]
Wakabayashi T, Ouhmich F, Gonzalez-Cabrera C, Felli E, Saviano A, Agnus V, Savadjiev P, Baumert TF, Pessaux P, Marescaux J, Gallix B. Radiomics in hepatocellular carcinoma: a quantitative review. Hepatol Int 2019;13:546-59. [Crossref] [PubMed]
Wang H, Zhou Y, Li L, Hou W, Ma X, Tian R. Current status and quality of radiomics studies in lymphoma: a systematic review. Eur Radiol 2020;30:6228-40. [Crossref] [PubMed]
Kottner J, Audige L, Brorson S, Donner A, Gajewski BJ, Hróbjartsson A, Roberts C, Shoukri M, Streiner DL. Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were proposed. Int J Nurs Stud 2011;48:661-71. [Crossref] [PubMed]
Zou GY. Sample size formulas for estimating intraclass correlation coefficients with precision and assurance. Stat Med 2012;31:3972-81. [Crossref] [PubMed]
McGraw KO, Wong SP. Forming inferences about some intraclass correlation coefficients. Psychol Methods 1996;1:30. [Crossref]
Moher D, Liberati A, Tetzlaff J, Altman DG, Group P. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS Med 2009;6:e1000097 [Crossref] [PubMed]
van Timmeren JE, Leijenaar RTH, van Elmpt W, Wang J, Zhang Z, Dekker A, Lambin P. Test-Retest Data for Radiomics Feature Stability Analysis: Generalizable or Study-Specific? Tomography 2016;2:361-5. [Crossref] [PubMed]
Sullivan DC, Obuchowski NA, Kessler LG, Raunig DL, Gatsonis C, Huang EP, Kondratovich M, McShane LM, Reeves AP, Barboriak DP, Guimaraes AR, Wahl RLRSNA-QIBA Metrology Working Group. Metrology Standards for Quantitative Imaging Biomarkers. Radiology 2015;277:813-25. [Crossref] [PubMed]
Zwanenburg A, Leger S, Vallières M, Löck S. Image biomarker standardisation initiative. Available online: https://arxivorg/abs/161207003
Kopp-Schneider A, Hielscher T. How to evaluate agreement between quantitative measurements. Radiother Oncol 2019;141:321-6. [Crossref] [PubMed]
Gisev N, Bell JS, Chen TF. Interrater agreement and interrater reliability: key concepts, approaches, and applications. Res Social Adm Pharm 2013;9:330-8. [Crossref] [PubMed]
Zaki R, Bulgiba A, Ismail R, Ismail NA. Statistical methods used to test for agreement of medical instruments measuring continuous variables in method comparison studies: a systematic review. PLoS One 2012;7:e37908 [Crossref] [PubMed]
Bartlett JW, Frost C. Reliability, repeatability and reproducibility: analysis of measurement errors in continuous variables. Ultrasound Obstet Gynecol 2008;31:466-75. [Crossref] [PubMed]

Cite this article as: Xue C, Yuan J, Lo GG, Chang ATY, Poon DMC, Wong OL, Zhou Y, Chu WCW. Radiomics feature reliability assessed by intraclass correlation coefficient: a systematic review. Quant Imaging Med Surg 2021;11(10):4431-4460. doi: 10.21037/qims-21-86

Radiomics feature reliability assessed by intraclass correlation coefficient: a systematic review

Introduction

Methods

Systematic search strategy

Study selection

Data extraction

Outcomes and prioritizations

Risk of bias analysis

Results

Literature search and selection

Statistics of the included publications

The characteristics of human radiomics studies

Radiomics feature reliability due to image acquisition

Radiomics feature reliability due to image reconstruction

Radiomics feature reliability due to image segmentation

Radiomics feature reliability due to image processing

Radiomics feature reliability due to feature quantification

The characteristics of phantom and animal radiomics studies

Quality of ICC use and reporting

Notable findings of radiomics feature reliability as revealed by ICC

High satisfactory feature rates were reported for most intra/inter-observer segmentation studies, indicating the high robustness of many radiomics features to intra/inter-observer segmentation variability

Comparable or better radiomics feature reliability was reported for (semi-)automated segmentation than manual segmentation with much shorter segmentation time

Acquisition had substantial impacts on radiomics feature values, and their impact on feature reliability was larger than the impact by intra-/inter-observer segmentation

Feature reliability and ICCs were heterogeneous for post-processing and feature quantification; optimized post-processing and feature quantification could be used to mitigate acquisition-induced radiomics variability

Shape and first-order (FO) radiomics features were frequently reported to be more robust to various variability factors than texture features in the original image domain

Other statistical metrics in conjunction with ICC

Risk of bias

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share