-
-
Notifications
You must be signed in to change notification settings - Fork 2
BioPAXValidatorRFC1
2009
A very "flexible" application (both console and GUI) that enables to define and check rules and report syntax and semantic errors and warnings, or better fixes those automatically. The following components may be integrated: Paxtools, Ontology Manager (helps with controlled vocabularies), MIRIAM database (for external database names and id patterns), biological ontologies (GO, MI, SO, etc.), and other OWL or generic model validation frameworks or tools.
The validator is to be implemented in Java, in a configurable and extensible manner. Basically, validation rules are to be derived from the BioPAX documentation and designed as "independent" generic java classes that use the Paxtools' in-memory BioPAX model. So that more rules can be easily added later. Rules will check across several entities, overlap in their subjects, or call other rules if required. There to consider will be both fail-fast and post-model validation modes (in most cases, e.g., when one imports and checks single OWL file, the fail-fast is barely required, but it may come to the scene in the future versions that will allow interactive model assembling and merging). Not all rules are applicable to fail-fast checking.
- Introduce new rules easily that can be interweaved/changed (1)
- Provide intelligible error messages that can help user trace back the cause of error. (1)
- Interactively change their error levels and behavior (e.g. auto-fix, fail-fast or post-construct). (1)
- Handle RDF, dangling RDF IDs (turn on the closed-world assumption) and XML "well-formedness" using the same configuration file (1)
- Normalization: some rules can normalize the model i.e. change it to conform to best practices e.g. fix duplicated PEPs
- Provide multiple reporting options (e.g. command line and web service) and message aggregation. (2)
- Fail fast validation - as the model is being built (1)
- doesn't check all closed world cardinality constraints
- domain, range, cardinality (e.g. kPrime has single float that has cardinality=1)
- checks specific rules e.g. duplicate PEPs, maintains sub properties automatically
- checks XML, RDF (e.g. ensures no duplicate RDF IDs, dangling RDF IDs)
- Post model construction validation routine (will check across entities) (1).
- Support BioPAX Level 3 only (rationale: level 1, 2 is more work to validate and this will encourage users to switch to Level 3)
- Validation via API (Paxtools) and user friendly web server. (3)
- Should not impact loading times significantly. (2)
- Ensure immutability - rules should not be able to change the model. (3)
- Rules should not be dependent on other rules.
- Rules will be BioPAX level specific.
- Errors reporting:
- Passes a human readable string message of the error and the RDF IDs of the objects involved.
- User-configurable - to choose rules to report and to auto-fix. (e.g. create an error manager class, then - classes of errors and logs that user wants to configure.)
- Default verbosity levels should be implemented
- In case of too many errors the validator should exit with a message after a threshold (3).
- Validation rules:
- Defined as classes that could be contributed (set up for community contribution). SIF rule system provides an example of how this could be done.
- Two types: best practices, normalization
- Best practices described in the BioPAX documentation: e.g. stoichiometry missing, conversion/control direction shouldn't be empty
- Normalization: verify standard names, controlled vocabularies (CV) - may require consulting external databases. Could use MIRIAM for normalization of IDs and database names.
- Technical semantics: Wrong unification xref problem e.g. GO IDs shouldn't be in unification xrefs of physical entities. Mixed use of BioPAX Levels
- Semantics: Conflicting cellular locations; unnecessarily generic conversion e.g. complex assemblies represented as conversions; entity conservation e.g. many entities, like proteins should have the same entity on both sides of a conversion, just different states. Checking that post-translational modifications match the sequence.