(Meta)Data Infrastructure for Trustworthy AI

Mar 30, 2022 | Newsletter #2

ProCAncer-I aspires to create the largest interoperable, high-quality multi-parametric Magnetic Resonance Imaging (mpMRI) dataset worldwide/globally comprising more than 11.000 retrospective and more than 6.000 prospective mpMRI examinations, including clinical data, for the study of prostate cancer (PCa). Based on these datasets the project will develop advanced, trustworthy, artificial intelligence models to address unmet clinical needs: diagnosis, metastases detection, and prediction of response to treatment.

However, trustworthy AI requires traceability and transparency in every phase of the model development process. In essence, traceability refers to the mandate to document the whole development process and to track the functioning of an AI model or an AI-based system used to support analysis and interpretation. In this direction within the ProCAncer-I project, a metadata catalogue is established to enable both data and model transparency. As the outcomes of AI/ML systems depend directly on the data training process, transparency in data collection, utilisation, and storage, is an area of significant concern. On the other hand, due to the rising complexity in modeling, end-to-end tracking of provenance information in the machine learning lifecycle, and on evaluating models for performance and trust are crucial. Towards this end, the metadata catalogue is used to store appropriate metadata for both the available data, the curation process followed for transforming and cleaning the data, and also for the development of the AI models and tools and their evaluation metrics.

Further, essential to building those models is a common data model that will be used for data storage and retrieval. ProCAncer-I adopts the OMOP-CDM, which is one of the most widely used common data models for supporting analysis of observational health data, to support the generation of reliable scientific evidence about disease history, effects of medical interventions, and health care interventions and outcomes. Besides the standard CDM, OMOP-CDM extensions are used, such as the Oncology CDM extension for representing cancer data at the levels of granularity and abstraction required to support cancer research. For radiology exams, although those can be currently registered using the OMOP-CDM, the model does not enable the storage of the subsequent curation process. As such, ProCancer-I aspires to introduce a radiology extension and is currently working on it in collaboration with the OHDSI Medical Imaging Working Group, focusing on including annotation, segmentation, and curation data as radiomics features that need to be stored as well.