Standards Q&A

Back to Standards Main Page

 

Q: Do we need data standards (or, do they have to be FAIR enabled) to extract the data for AI?

A: Yes, for some fields it makes sense to have standards, as standards help those researchers that may try to replicate results and to rule out potential errors. But this does not mean that standards are strictly necessary in every circumstances.

The primary utility of data standards is that they ease the extension and scale of downstream analysis. Consider the simple problem of analyzing subject ages across several thousand independent data providers. Before we can do any analysis (or even a simple plot) we must first ensure that the data is in a consistent format. The problem is that different data providers might store the subject ages explicitly (in years) or implicitly (as dates of birth). If they store the age explicitly, it could be an: integer, floating point, or character string. If the data is stored as a string, it could have a variety of assumed formats: ‘29’, ‘twenty-nine’, ‘one hundred’ etc. Dates of birth could also have a variety of formats including: datetime, string, or even a set of three integer columns that denote birth year, month and day respectively. If the birth dates are strings, they could be stored in variety of formats: ‘YYYY/MM/DD’, ‘DD/MM/YY’, MM-DD-YY, YYYYMMDD etc. Bringing all these formats together (even for a simple variable such as age) would require quite a bit of work because a standard was not in place.

 

Q: Do the current commercial and hospital platforms have a standard for common measures such as age? 

A: I’m not sure if there is a formal standard. When dealing with hospital platforms, data will e pulled from Electronic Health Record (EHR) systems. There are hundreds of EHR vendors, so extracting and consolidating information (including age) across them might be challenging. Importantly, the four largest EHR vendors cover ~75% of the market: Epic, Cerner, Meditech, CPSI. So, obtaining interoperability between this subset would solve most of the problem with only a fraction of the effort. The extent to which records from these systems are interoperable is questionable, but a standard might emerge as a natural consequence of an interoperability or mandate [see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2871223/]. Most often, data standardization involves consolidating ~20% of the formats that cover 80% of the data (by volume).

 

Q: How are the standards in life science domains compared to some of the AI industry standards, for text, imaging data etc.?

A: Even though there is an incredible amount of progress in the text and image processing communities, there is not a de facto standard for how raw data is formatted and shared. Note that communities which collaborate (or compete) on benchmark datasets tend to organically develop more uniform data standards that those that work in (relative) isolation. I imagine that this is also true of the life science domain.

 

Q: What are some of the examples of industry standards supported by the popular machine learning/deep learning platforms?

A:   

 

Q: How do we make a mapping from the current life science data standards to the existing ML platforms?

A: “Toolkits” that can perform these types of mapping is considered one of the most important components in data standards. For instance, there are very good mapping tools converting Medical imaging in DICOM format to the formats such as pytorch or tensorflow data, such as this library:  https://pydicom.github.io/pydicom/dev/tutorials/index.html.

Genomics data can be mostly converted to follow those ML data standards about Text; Medical imaging data can be mostly converted to follow those ML data standards about images; And time series medical data can be mostly converted to follow those ML data standards about audio/time series.

 

Q: How do we go away from de facto standards and popularity contests that may not be the most appropriate for the context of use?

A: That’s a great question that has multiple possible answers! One option is to support data curators and developers of data curation methods, tools, standards, techniques etc. Another option is to provide incentives (something like a tax credit?) to entities that structure, document, and distribute their data so that it’s easy to use.

 

Q: What does it take for whole communities to use the same data attributes – a seminal publication with benchmarked standards?

A: There are at least four factors that influence the uptake of data, and associated data standards, by a community:

  1. Completeness:  Does the data contain the necessary attributes and sufficient samples to answer key questions of interest to the community?
  2. Reliability:          Is the data accurate, free of errors, etc?
  3. Accessibility:      Is the data easy to use and access, well documented, etc?
  4. Awareness:        Are people aware that the data exists? A seminal publication can help with this.

If there isn’t a mature data set or data standard along these four axes then an entity with a large presence usually sets the trends.

 

Q: The Standards Warehouse file provides detailed information on existing standards. Can a DATA scholar easily identify the standards from the file for the types of data that s/he uses in research?

A:  Using the example of EEG, ECG and other biomedical signal data, one can locate ”NEMO” and “ECG” through a document search (Ctrl+F). The modality information (e.g. EEG or EKG) may be further specified in the future (if applicable).

 

Q: How can the standards information NIH prepared help a researcher to acquire new data?

A: Knowledge of guidelines, terminology, and data formats will help a researcher to format data (and meta-data) before sharing it with other investigators in the research community. For example, many researchers in the NLP community are fond of .JSONL format for sharing their text data.

 

 

 

Back to Standards Main Page

Table sorting checkbox
Off