Application of Privacy Preserving Federated Learning in Biomedical Applications – Lessons Learned from the PALISADE-X project

Contributors
Ravi Madduri
Institution/ Affiliation
Data Science and Learning Division
Argonne National Laboratory
Presentation Details (date, conference, etc.)

PALISADE-X resources, APPFL — APPFL documentation

 

This presentation below occurred on November 16, 2022 described the project funded in response to this DOE-NIH Bridge2AI collaboration:

See Bridge2AI wiki pages

PALISADE-X presentation

Responses from the DOE PALISADE-X on Privacy Preserving Federated Learning and its application in achieving Bridge2AI program goals

1. Can you further describe/define the data shift problem? You talked about AI models not performing well when training on datasets in different locations. Is it because the curation processes are not harmonized between datasets?
The issue is fundamentally about data distribution differences between the data used for training and data in real world. It has less to do with harmonization processes and more to do with lack of variety in datasets used in training. Harmonization may help in making datasets available for training from different, disparate sources. Privacy preserving Federated Learning, coupled with harmonization, is a potential solution to this challenge. Using this method, we send models to where data is located and hopefully harmonized to a common format so the learning is aggregated. Another method, that we are working on, is to generate synthetic data using GANs trained on data at each location. We are currently working on effects on model performance when trained on real data vs synthetic data generated from a GAN trained on the real data – especially in biomedicine.


2. The purpose of Bridge2AI is to have harmonized, ethically-sourced, diverse data prepared, and stored in different locations. Would this present a huge problem with regard to data shift? Would it not be possible for future AI/ML modelers to train on Bridge2AI datasets?
I think, if the data is harmonized and stored in different locations, you will need a federated learning approach to generate insights from the data. The Bridge2AI approach solves an important problem standardizing the pre-processing, harmonization of biomedical data so it is easy to build models that are generalizable, fair and trust-worthy. As non-standard data preprocessing practices are another big hurdle in AI and generally hurt the reproducibility and reusability of AI models. I think Bridge2AI can provide standard datasets akin to ImageNet that helped bootstrap computer vision.


3. Would your Differential Privacy Algorithms help correct the problems in #1 and #2 above? Would your algorithms be able to harmonize the various ethical issues from the disparate Bridge2AI grand challenges such that privacy is preserved in a manner that is “customized” to the associated ethical contexts of each grand challenge?
The different privacy algorithms help retain the privacy and sensitivity of data in the context of Federated Learning. We are working on harmonizing privacy issues from different datasets, in terms of coming up with recommended privacy budgets. This can be a good area of collaboration.


4. You mentioned that only users who are authenticated can use your network, is this correct? What is the process and decisions used to authenticate a user? Can anyone in the scientific research community be authenticated?
We have used a standard method for identity and access management (IAM) to authenticate users which will make it easy to extend to NIH identity providers (like eraCommons). So a Bridge2AI user can easily login and authenticate to a Bridge2AI- PPFL (privacy-preserving Federated Learning) as a service using their NIH identities.


5. As you know Bridge2AI is about open scientific discovery, which works against locked, pristinely preserved, private data. How would your algorithms help promote open scientific discovery while preserving privacy?
The PPFL approach helps create generalizable AI modes while preserving privacy of training data without the need for central aggregation of all the data. This approach helps unlock the full potential of applying AI to biomedical data without having to collect sensitive data in a central location.