But it's easier to parse HTML5 than it was any previous version of HTML, as there is now an actual specification which details the process exactly rather than relying on each browser's interpretation. It can't be that difficult given the number of working parsers and validators out there for HTML5.
Plus, HTML5 can already be written using XML syntax, aka XHTML5. And searching for xhtml5.xsd or xhtml5.rng gave me plenty of links to schemas for validating XML-syntax HTML5.
If you need to store validated documents, then you shouldn't be storing them in HTML format! Store them as XML documents with well-defined schemas (Relax NG of course!), and then use XSLT or possibly XQuery to turn them into HTML fragments for display.