What do medical practitioners and canoeists have in common, and how can they both be helped by data scientists?
The answer is they can both benefit from ontology!
But what on earth is "ontology" and what possible use could it have?
Picture this: You are a fiend with a sweet tooth, and being British (or Canadian) you go to your Walmart store in New York and ask for the Smarties, and you're not given Smarties, you're given Fizzers (or Rockets) because "Smarties" are in fact something different in the USA.
Also imagine if you're going to see a "football" game and they're playing something that looks like rugby with body armour.
And Canadian canoes? Well they're just canoes aren't they? How's this a "Canadian" canoe when it's made by Nelo, a Portuguese company, and is an International C4 rather than a Canadian C4?
Really annoying isn't it, you discover that you're not actually talking about the same thing despite using recognisably English terminology that has a clear meaning to you, but a different meaning to someone else. This can get even more complicated in other languages, in Danish, turtles and tortoises are not distinguished (they're both skildepadder) for example.
The problem is, the same phrase can mean different things to different audiences, whereas a concept can have multiple ways of describing it.
This is tricky when we want to collect statistics, how do we know what any of this means? The problem is you have to understand what the terminology means, and how different sources might use it to compile the statistics properly. Take the haematological malignancy multiple myeloma. It can be called multiple myeloma, plasma cell dyscrasia, Kahler's disease, or plasma cell leukaemia. However it is a different disease to myeloid leukaemia. How do you know that without being a nerd who's interested in multiple myeloma? We'd need a nerd on each and every concept to compile this data.
This is where ontology saves the day. Using ontology means that the instead of calling something a name that we decide, we call it a code instead. In the case of multiple myeloma, it's given the OMIM code 254500, the ICD-10 code C90.0, the DiseaseDB code 8628, and the MeSH code D009101. People who log statistics for hospital visits (known as clinical coders) use these codes to log the conditions instead of a name for the disease, and scientists who write publications on these conditions refer specifically to these codes to avoid ambiguity.
This is what ontology is. There are some challenges with good ontologies, owing to the fact that computers and humans process information differently, but these ontologies help collect medical statistics, which is useful for epidemiological research, and public health.
Sprint Racing Results Service also uses an ontology. Do you notice in the page URLs (Figure 1) you can occasionally see flags saying "JSV=J" and similar?
Figure 1: Codes used in the Sprint Racing Results Service to define the race types. This type of ontological classification allows races to be logged in the database and searched for in an easier manner.
These are there because the race classes are stored in an ontological system. Races are not logged as their name, they're logged as codes.
Take Boys D Kayak for example. I could record it as Boys D Kayak, and in the various database queries search for "Boys D kayak", but I'd have problems with all the misspellings of "Boys" that exist in the raw data that I can't be bothered to tease out. So I don't record it like that, I record everything that looks like "Boys D Kayak" as JSV=J, MW=M, CK=K, Abil=D. JSV=J and MW=M mean boys (junior men), CK=K means kayak, Abil=D means the D ability band. From these flags I can rebuild the class name.
As well as saving space I can create class names consistently. All instances of Boys D Kayak will display as such, not as other, weirder things, providing I have a robust algorithm for converting the mixture of codes into the correct race name. I can also in future change the name of the class and update all instances in the database automatically by amending the algorithm without changing any raw data. If it's ever decided that "Boys" is patronizing and should be "Junior Men" instead, I just update the conversion script. If Denmark ever invaded, I could change it to "Drenge" so that the database conforms to the new Danish language codes, og jeg vil konvertere koderne til Dansk på et senere tidspunkt for at undgå at gå i fængslet.
I can also search simply for broader classes, all senior men would be JSV=S, MW=M. If one race has two classes both can have codes, and show up in searches. There are no annoying things that need to be worked out when searching for human readable things, like how to I get "mens" races without also getting "womens" races but not exclude mens races where they're mixed with womens races.
This is why ontology is good, behold the excellence of ontology and tremble before its glorious usefulness.