WG-10 - Discussion on Standards and Sharing

04Aug08: Steve Stevenson (steve@cs.clemson.edu)

A Tan Paper on Standards and Sharing [A “tan” paper is a not-so-clean white paper :-)] 

Analysis It seems to me that these two issues need to be de-convolved.

Standards are useful when ideas on not changing rapidly – the standards process is 'way too slow to react to the research environment. From the supercomputing side, here's two examples:

1.LAPack. The supercomputing community has synchronized on LAPack, the successor to LinPack. The motivating idea is that LAPack is centrally maintained and uniformly extended.

2.Language Standards. Language standards always lag usage. For example, by the time Fortran really had support for vectors, the community had moved on to distributed computing.

Sharing, on the other hand, requires agreement. That agreement does not have to have the force of standards, only that you and I agree as to how I should interpret what you're telling me. The Semantic Net folks want this to happen through ontologies and meta-models. Those still stake agreement. It is worth it to consider the three issues in front of us:

1.Machine-Machine Interchange. Remember ASCII? And the ASCII/EBCDIC wars? That was over flat files; database systems have had the same sort of history (CODYSYL versus IMS2 versus SQL). Databases arose from the realization that data is structured. The ?ML world allows us to transmit structured data in a flat file. What ?ML's don't do is send meaning!

2.Human-Machine Interchange. I think (hope) that the majority of HCI is via more humane processes than typing pseudo-mathematics. But, this form of input is also dangerous in that the transformations become obscure when we need it to be transparent. On the output side, the graphical/movie outputs seem to be the most useful. However, we again have an issue of correctness.

3.Human-Human Interchange. HHI would seem to be the most stable. But we have all sat in meetings wherein the bone of contention is a “standard definition”.

HHI is ameliorated by education. However, this is not so easy because of the continuum of knowledge required. Using Bob Panoff's model, there are three issues denoted A3

1.Application. There is domain knowledge that must be shared, especially in the interdisciplinary world. Sharing among disciplinarians is easy (well, relatively). But even, say, my background in biochemistry is 'way out of date – you would have to educate me.

2.Algorithms. There's two parts to algorithms: the mathematics independent of the architecture, and the modified versions to take advantage of the architecture.

3.Architectures. Architecture also has two parts: hardware and software.

But this is 'way too simple: The decision is at least 3A+(3 chose 2)A: The three listed above, then combinations like “applrithms”, “algotectures” and “appltectures”. [I proposed those long ago, they never caught on :-)]. Not every combination of A's is going to work well on every problem.

Suggestions

1.Consider standardizing those things that can be standardized. Sync on LAPack, EISPack, ...?

2.One possible approach is to copy the ideas of the computer algebra implementers. They use a standard “kernel” - the interface is separate and therefore more flexible. The vast majority of researchers need modeling support and really don't care about what's under the hood. Great place for Open Source. This also dovetails with Dan Cook's ideas.

3.Consider education issues. The question isn't so much the standard education of researchers (experts), it's a question of what should the undergrads, grads, and PhD learn. In my experience, the students tend to lead the faculty, but osmosis probably isn't enough. How do we get interdisciplinary education moving: get those programmers out of their little holes. Same with math and engineering. Too many people living in stovepipes.

4.One possible approach, coupled with the “kernel” idea, is to explore ways in which the data storage (?ML's) can be manipulated. For example, I might come up with a new version of a standard (like CellML). In this, I should be required to provide a translation (injection) of the old format to the new, and the specs (if not the code) for the kernels to read the new format.

5.We teach and learn by way of models and modeling. Modeling is all about problem solving in the discipline and it is a subject unto itself. I think the use of the term “modeling and simulation” should be banned. If the simulation is the model, we're in a lot of trouble if we're to understand what the model actually is.

6.Modeling is about knowledge applied to a problem. While GUI input systems are a big help, they eventually break down. Developing a new GUI should take months but not years. The learning curve can't be ignored – going from an older to a newer system may well be harder than just knowing the newer. Need to be able to make new GUI's faster.

One possible home might be the NSF Cyberinfrastructure program with its desire to build virtual organizations (VO). It seems to me that the standards issue is perfect for VO's.


01Aug08: Daniel L. Cook (dcook@u.washington.edu) introduces himself and colleagues to MSM working group

Dear group,

This is Dan Cook, colleague of JBB in Seattle, and a former collaborator with several of you on the (now defunct) DARPA Virtual Soldier Project. From the ashes of the VSP emerged a collaboration between myself (a biophysicist and modeler) and a team of bioinformaticist/ontologists led by my colleague, John Gennari. I will be presenting our work at the upcoming MSM MiniSymposium (slides are on the Wiki for Session), but JBB invited me to review and offer comments on the discussion on this Wiki. It is clear from our own experience and the testimony on this Wiki that there are considerable challenges to creating the social environment, institutional incentives, technical platforms and standards by which models (and modules) can be reused and integrated into multiscale models. I am looking forward to discussions of these topics. For our part we are focusing on the “Group 2” issues in JBB’s “Checklist of model standards”: “Model Structure and Content”. As we read and discuss the Wiki contents we come across appeals for “automation”, “modules”, “automated construction of models from component modules archived in standard form”, etc. These are precisely the goals we have been work toward with some success to-date — but a long way to go. As I will discuss at the SIAM meeting, we are taking a distinctly informatics approach in which we re-represent models as ontologies rather than merely using ontologies as structured vocabularies according to which model elements may be annotated.

We believe that to facilitate the creation and reuse of modules, we need to capture *what* is being modeled as well as the mathematics being used. We do this via ontologies -- by creating a declarative knowledge structure (i.e., an ontology) of the specific model, and by referring elelments of this knowledge structure to reference ontologies that capture shared, fundamental truths about biology and physics. Thus, a “SemSim” (semantic simulation) model captures both: 1) the biological structure and biophysics of the biological system being modeled, and 2) the structure of the model’s mathematics. These structured reprsentations are machine-readable, machine-searchable and highly-structured enough to allow (semi?)automated code generation in any of several languages. We envision that such SemSim ontological models will have the advantages attributed to Java — “curate once, run anywhere”. The SemSim approach is based on a simple biostructural/biophysical schema that is multiscale in terms of structure (i.e., molecules...bodies) and multidomain in terms of system dynamics (fluid flow, chemical kinetics, electrophysiology...). Furthermore, it lends itself readily to defining and extracting modules from existing models and reassembling them into integrated new models for different problems and purposes. As our approach is heavily driven by use-case challenges, I am looking forward to broad discussions in Montreal on what the real-world requirements will be for our success in this endeavor.

Daniel L. Cook, MD, PhD

Research Professor Depts. of Physiology/Biophysics School of Medicine University of Washington Seattle, WA 98195 dcook@u.washington.edu UW office: 206 543-7118 Home office: 206 297-7794


29jul08: James Lawson (j.lawson@auckland.ac.nz) writes to JBB on plans for CellML:

Dear group,

I thought I'd give comment on some of the ideas that I have been considering for implementation in the next generation of the CellML Model Repository, and what implications these ideas might have for model sharing.

Models are currently considered to be primarily static entities. Someone begins to develop a model, and over time it passes through a series of iterations which hone it to better produce the desired output. The author then decides at some point that the model is worthy of release via the traditional publication media, and takes a 'snapshot' of the model, with its equations and attendant parameter sets. As we are well aware, most models have their warts; the official way to rectify a known error or omission in a publication that describes a model is to request that errata be published.

If these models are to be used as building blocks in large, multi-scale constructions, then any fundamental issues with them must be ironed out.

This is often not an absolute process - in the CellML repository we often iterate through numerous versions of a model during the curation process, fixing errors as we come across them and as we receive new information. These fixes take a variety of forms and may in fact alter the model substantially; the model in effect takes on a life of its own, post-publication. This suggests to me that the model as published may not necessarily represent the definitive work. The author may wish to make adjustments and tweaks to the model that improve it, but do not warrant a further publication. Alternatively, users of a model may develop it further. Therefore, to fit our requirements, models may have to be treated as a dynamic entities, rather than static. If a community grows up around a model who are mutually interested in its use, development and curation, then, to use a software development paradigm, we are likely to see many revisions of the model, and many variations as a result of branching. It should be the role of a repository to provide a 'place' where such a community can congregate around a piece of code and to archive discussions and revision histories pertaining to the model. I am currently considering workflows and frameworks that we can implement in the CellML Model Repository to support the extensive collaboration that we are going to need to leverage in order to build up complex, multiscale networks of models. The 'user-generated content' paradigm is quite pervasive on the internet, and has been now for some time. If we can work out how to moderate input to the repository, we will be able to collate a large amount of information and expertise on individual models and sets of models that will vastly improve their reusability. This potential extends far beyond that of models which are simply presented once in a traditional publication.

One issue I see with this approach is that to build the large conglomerate networks of models that we will need to comprehensively describe any multiscale biological system, researchers will need to be able to treat subsystems as black boxes. However if foundational models can not be relied upon to be the same from one day to the next, this presents a serious dependency problem. Thus, at some point, models will have to be frozen in order for them to be reused. However, I see the investment in freezing a model in such a manner to be much less than the current requirement that a paper be written about it. If systems such as regular, automated, unit-testing of models are implemented (that is, checking that the system works, part by part, so that any changes that break other parts of the model are identified,) this problem should be manageable.

Kind regards, James Lawson


27 jul08: Steve Stevenson on Peter Hunter's comments.

I think that Peter's comments have been very good. And his observation about wheel reinvention is right, I believe. At least in computer science, we are in the "new computer system" business that rewards "new and unused" over "old and works" :-)

At the risk of being philosophical :-) ...

Just what is a model in this context? At least in principle we should be able to store all the information to reproduce an instance of the model for any simulation system. The various balances and type (unit) checking are meta-properties that should be enforced in the knowledge base.

<opinion> The question of model sharing is a question of sharing knowledge bases, or, in the Semantic Web world, ontologies and metamodels. What we're fighting here is the learning curve, which can been extensive for even a focused simulation system.

Modeling and sharing are primarily psychological requiring education. Ontologies by themselves are not sufficient because they are meant to address the vocabulary issue. But science is more than vocabulary and require understanding of the overlying metamodels as logical structures. The database has to be laid out to deal with the connections, rules, and vocabulary. </opinion>

26jul08: Peter Hunter writes to JBB: Dear Jim et al,

Following the very thoughtful letters from Andrew and Steve, I thought I'd add a comment from our Auckland/Oxford experiences with CellML model curation.

Re the question 'why is model sharing so hard and what we can do to make it easier', in common with others I think that the root of the problem is that our scientific processes do not provide rewards or incentives for model sharing. Typically when someone writes a modeling paper there is little thought given to how someone else might use the model -- the goal is to get a publication. Not too many universities give rewards for collegiality, nor do granting agencies (although there are signs in both the US and Europe that this is changing -- e.g. the MSM grants and European VPH grants).

The following does not address the sociological issues but is, I think, one - not the only! - way of achieving model sharing goals and represents the strategy we are currently pursuing with the CellML project.

1. We code up the model in CellML from a peer-reviewed publication or an about-to-be-submitted publication. This takes 1-2 days for someone familiar with CellML but not the particular model - and is usually done by a CellML curator working jointly with the author. It would take a few hours for the author of a model if that person was also familiar with CellML. This produces a model that works, has consistent units and is consistent with the publication (or at least the deficiencies in the publication are revealed!), and can be run with the CellML simulation software such as PCEnv, COR & JSim (we normally check with PCEnv & COR). But it is not yet capable of being combined with another model. Also, while the model may, hopefully, have been validated against the biology it is describing (at least to some extent), as part of the peer-reviewed publication process, it will almost certainly not have been tested under a wider range of conditions and is quite likely to violate some of the biophysical constraints that Jim has highlighted, such as conservation of mass, conservation of charge or thermodynamic feasibility.

2. The next step is to annotate the model using terms from the standard bio-ontologies. This annotation is stored as metadata -- it gives biological meaning to the mathematical terms in the model and provides a unique reference for each component of the model. Doing this annotation requires knowledge of the biology described by the model and is best performed by the author -- but can also be done by the CellML curation team. The key to getting the authors to do this is for us to improve the tools (e.g. coupling PCEnv to Protege).

3. Now import two models that have been individually CellML-ised, curated and annotated, into one new supermodel. It is imperative to have good visualization software to render the components of both models. We do now have good SVG based visualizations but not yet automatically created from the ontological terms in the model's metadata -- that is still probably at least 6 months away. Currently a time consuming process is undertaken to find the common components between the two submodels and also to define coupling where needed. It is not yet clear if this will be able to be completely automated -- I suspect 90% of the work can be automated but that there will always be some user intervention required.

4. Finally the combined model is run, subjected to automated unit checking and checked against available experimental data that tests the combined model. Of course we would all really like to have automated tests applied to check the various applicable biophysical constraints. I see no reason why this will not be possible in future releases of the CellML tools.

It will be important in the future to add biophysical constraint checking into the tools in a way that allows these constraints to be applied as the models are being authored in the first place. Dan Beard's idea, that I fully concur with, is that generating CellML models from a database of reaction parameters (with their ion species dependencies etc) is the right way to generate metabolic (and possibly signal transduction) models. It then becomes much easier to impose the biophysical constraints.

Another issue is how we should be encouraging collaborative efforts on code development. The software "wheel" continually gets reinvented -- which satisfies a deep-rooted desire that engineers and computer scientists seem to have to write their own code, but means that we are well behind where we should be in sophisticated multi-scale biomedical modeling software - if only we could only figure out how to incentivise people to collaborate on software!

Cheers, Peter


25jul08: DL (Steve) Stevenson writes to JBB:

I am involved in the National Science Digital Library's CSERD (an archive of models and simulations for education) Pathway project that is trying to enforce V&V standards on our materials. We too suffer from a lack of input from the community. And we don't see how to fix it, either. We have actually tried to recruit folks who would be "editors" (I assume your "curators) on such a process; that's been real hard to get people. We believe much of the problem is that correctness is not what drives the train. My (cynical) view is "Speed is the opiate of the programming class."

I think one of the ongoing projects should be developing metadata and tagging the projects so that people can find what they need. This will become crucial when we want to construct "full biology" models of organisms from validated parts (federations).

> I am also concerned that the feedback re the standards is really nil. > How can we collaborate without them? Or maybe we can, and they are not > needed. Or would it be OK to go laissez faire on this and leave it to others > to define standards.

What you have in the spreadsheet looks reasonable, but the proof is in trying to apply it. My feeling is that if it can't be enforced (type checks, unit checks, ...) by software, it won't get done.

I think history shows that standards arise when there after a crisis or demonstrated need. Engineering standards are a part of the community it serves, and generally work if the community buys in.

I can't lay my hands on the title right now, but I read a book that discussed how the concept and development of standards in the traditional engineering sense is not what is happening today. Standards are centrally controlled, but model and system development is pretty much autonomous. This means that information is developed and passed around informally through networks of workers.

As my industry friends ask, "What is the value proposition?" What do I gain for my investment of resources into standards? One value proposition might be a real archive with query support. My experience with archives like NetLib is that there's no guarantee that you can use the codes even if you can find what you can use and modify it to your problem. A lot of the problem seems to be the "reproducibility" issue.

Greg Wilson of Toronto has started talking about VV&R: verification, validation, and reproducibility. I think he's thought it through and is right. The reuse crowd misses the boat IMHO because in modeling the question being investigated dominates decisions. While it would be nice to plug-and-play, I don't think we know enough yet about system development. Truly reusable code probably is not possible under the current programming paradigms: we're too close to the machine.

So ... collaboration should be pursued, but maybe it's not the traditional activity. Numerical folks would be happy to help with the numerical parts but may not have the background to help in the domain science *directly*. Ditto computer science; etc.

> MIRIAM does not suffice. Our tests of the BioMOdels database shows that many > are inadequate, and yet they are working hard at it. SEE MY TIRADE ON THE > IMAG WIKI "ModuleRules" under WG10, Monthly reports.

The NSDL project is focused on metadata tagging and our Pathway has seen that it does not work in the interdisciplinary setting. Our view is that you need to treat metadata as ontologies but to process the information you must have logical rules. I've just (literally in the last two months) starting looking at this. Alan Bundy at Edinburgh has worked a lot on this question of using ontologies, meta-models and queries. "Inadequate" is a matter of degree.

Steve


24jul08: From Andrew McCuloch to JBB:

A couple of comments: From a review of the discussions it is evident that progress has been made over the past year, especially in working on the use of physical constraints to validate shared model descriptions. The value of this is tremendous and can't be underestimated.

I think the lack of greater participation here (and I count myself among the under-participants) does not reflect lack of support or interest so much as that the MSM is primarily devoted to research on new multi-scale modeling strategies and problems, whereas CellML and SBML (and standards in general) are codifying representations of established model classes which happen to be systems models that do not include structural details, which is the hallmark of most multi-scale models (other than those that are primarily multi-scale in *time* rather than space). That is not to say that such systems models are not important to structurally integrated multi-scale modeling, but sharing and representing them is no longer the rate limiting step in multi-scale modeling. For example, sharing a complex three-dimensional anatomic model of a cell or organ in a platform and software independent way tends to be much more time-consuming for us. That certainly does not argue against standards. On the contrary it argues for more and additional standards such as tissueML and anatML.

As I mentioned in Pullman last week, I think there are also some psycho/social aspects working against the broader adoption of these standards: One is the common impression (whether right or not) that since such model representations or the models themselves are rarely flawless, then they are unreliable. That assumes an alternative source is reliable, which is unlikely, but again it only argues for greater community participation in curating and validating models. The second is probably that the effort of such curation is not generally recognized or rewarded as productive academic activity. Even if it were, it would get no more credit than does service as a peer-reviewer so we need to look to the scientific benefits rather than the professional incentives. I think the tipping point of cost versus benefit is close, and we should now be prioritizing new efforts in standard development specifically those addressing unmet needs in multi-scale modeling regarding the representation of complex biological structures at subcellular and multi-cellular mesoscales. Well that's my five cents worth.

....from Andrew McCulloch


22jul08: Please see file "ModuleRules" under Monthly Reports (above) for discussion at August IMAG mtg. ...Jim B

8july08: from Jim B, Please see the new, totally reorganized Spreadsheet "Standards.list.jul08". Comments on this before the IMAG Montreal meeting next month would be greatly appreciated.

march 08: Jim B. The discussion below is very useful, We need more of it. The gathering at Experimental Biology

27oct07: Excel spreadsheet list for modeling standards Media:Standards.list.xls is provided here for download to check individual models. For Discussion: Is this list useful to modelers who want their models to be available to others? Could it be used as a set of expectations for publishing models? Does it help to know which of the standards are being met? Is this list an aid to reviewing models for publication?

4 Sep 07. While doing research for a book, I found an interesting discussion on "Models in Science" on the Stanford Encyclopedia of Philosophy, http://plato.stanford.edu/entries/models-science. Extensive bibliography.

9 Aug 07. Roy Kerckhoffs. Jim asked me to comment on the Model Standards. The requirements for a Class 4 model are really strict! Especially for multi-scale models it might be too strict and would never end up in there. For example, what if only a small component of a multi-scale model is empirical, that would render the whole multi-scale model not of Class 4, even when all the other components would be biophysically based. Also, the mass, charge, thermodynamic, etc. balances combined with the validity, verification, and documentation combined make it a strict classification. Maybe those four parts (1-5, 6-9, 10-11, 12-16) should be evaluated/scored separately for each model, for which A means it obeys the rules, B it does not and C it doesn't apply? Just a thought more or less in the line of the paper of Smith et al (J Exp Biol 210:1576, 2007). So a bio-physically detailed model that has been demonstrated valid, is verified, but not well documented would look like AAAB?

I like the small reference to Platt's principles. A model is a simplified representation of reality to understand reality. But if a model approaches reality (as a matter of speaking...) then we need another simplified model to understand that original complex model!

I'm missing conservation of momentum and moment of momentum under "biophysically based models should have:" but I guess it falls under conservation of energy.

Initial conditions. I'm not sure whether initial conditions should be consistent with a steady state. I guess that means that initial conditions should be part of a steady state solution. Steady state can still be reached from a non steady state: I think it's more important to state that the model is provided with initial conditions and parameter values that would lead to a stable model, but maybe that's what is meant.

27 Jul 07. Response to Jim Bassingthwaighte. I will certainly try it. I had been led to believe that SBML was in for a complete makeover. Shouldn't be that hard to write filters to reformat.

In a more constructive vein, some thoughts came as I read the standards.

  • 1. On page 1, last paragraph, "multiple sets of experimental data." Is there a standard by which that data is massaged (not necessarily in J-Sim)? Response-surface generation, for example?
  • 2. Is there a "class 5" model classification? It would seem that we would want to have consistent meta-data tags etc. We've run into this problem on other groups I've worked on. I'm kind of a "general systems" maven, at least as a metatheory and vocabulary. I believe "systems biology" is really just an outgrowth of GST.
  • 3. I like your sentence on page 3 "Models are working hypotheses summarizing the integrated concept of framework for a body of observations." But I don't think it goes far enough. My own V&V view is that consistency and conherence are the hallmarks of usable models. In the verification and validation community (V&V) in the engineering and DoD areas, the community has come to the conclusion that models are really for decisions or judgments. That's sort of an implication of your Huxley quote down further. For dynamical systems we tend to gloss over these issues, but it seems to me that biological systems have to be really careful. A good addition to Huxley's quote is Box's quote: "All models are wrong, some models are useful."
  • 4. V&V. The national labs, DoD, and two engineering professional organizations-ASME and AIAA-have standards that we can look at. I'm assuming that biological modeling is no easier that CFD, which is the focus of the ASME and AIAA standards.

23Jul07: Jim Bassingthwaighte. Response to Stevenson:

  • JSim
JSim is a Simulation interface supporting primarily the modeling analysis of data, from model development, through verification and validation, and comparisons among models, to parameterization of experimental data, and routine clinical data analysis. Coding can be done directly in JSim's MML (Mathematical Modeling Language). It provides a variety of numerical methods for ODEs and PDEs, sensitivity analysis, optimization of model fitting to data (several optimizers), repetitive operation (Loops) with automated parameter changes for model exploration.

Model code in JSim's MML can be automatically generated from SBML and CellML. In addition to parsing MML into Java for computation, JSim is also serving as a front end for models in FORTRAN, C, and Matlab, A course using it will be given September 8th to 15th at U.Washington. Free download of models and JSim at:

https//www.physiome.org/jsim

The JSim system will be demonstrated to the IMAG/MSM group usiing BREEZE , probably on 23aug07. Try it out, with or without downloading it, at www.physiome.org.

23Jul07. Hi, I'm Steve Stevenson from Clemson, and I am new to this group. I put my comments in a separate file. Please let me know if I'm out of line .... File:23 July 2007.doc

23July. From Herbert Sauro: Just to add to Dan's comment on a kinetic database. I think this would be a very nice idea, particularly with the rise of synthetic biology where a parts database and spec is even more pressing. I would assume that your database will have a programmatic interface (e.g Web Services) to allow software to access the data? So many biological databases are often closed to software (an oxymoron!) and developers have to resort to s-called HTML scraping to extract the data from the database.

23July2007: I have suggested developing a kinetic mechanism database. This would be a database of functional units of biochemical and electrophysiology systems models. A sample entry and a set of proposed criteria for the data base can be found at [[1]]. -Dan Beard

22jul07: Peter Hunter: CellML Description and DiscussionHunter21jul07

20jul07: Discussion 18-20apr07 on Standards and Sharing StandDisc20jul07

20jul07: Garfinkel1969. Sent by Dan Beard.

Report of Discussion 10april07 on Model Sharing Media:Sharing07.doc from James Glazier

19jul07: A revised set of standards Media:Standards19jul07.doc. Revisions by Andrew Miller, Anushka Michailova, and Jim Bassingthwaighte. In this revision we have reduced the "personal" U.Washington overtone, but it is still there. The "standards" as set forth are not nicely graded, and therefore don't lend themselves to grading models against a standard. should this be a goal? Can Classes 1, 2, and 3 be better defined? Are there missing requirements? Herbert Sauro has suggested that archival systems liek SBML and CellML might be more useful if the models were defined more clearly in terms of the biology, not just the mathematics. Comment? ......JBB

3jul07: Please review the Modeling Standards: We'd like to have an agreed upon working draft at the end of July...jbb SEE Media:Standards19jul07.doc

June 07: Standards and computational software (Herbert Sauro) Software and Standards

June 07: Peter Hunter's document on model sharing Media:IMAG_model_sharing_proposal_PJH.DOC

Table sorting checkbox
Off