Genetic Data Sharing - Richard Crooks's Website

How Data Sharing Helps Clinical Genetics

It is said that "the product of a laboratory is data". Laboratories (of all forms) take samples and analyse them in order to produce data of interest to service users. Whether this is patient samples to help diagnose disease, manufacturing samples to help monitor manufacturing processes, or research samples to advance scientific knowledge, all of this produces a product, which is data.

As genetic medicine becomes more widely used, the amount of data being produced is dramatically increasing. However genetic medicine is perhaps unique as a medical discipline, in that the data produced by it is not purely diagnostic, but actually crosses into research.

In a traditional laboratory medicine discipline such as biochemistry, a result of a test describes what is going on in a particular patient. The reference ranges of such tests, and the clinical interpretations of those tests are derived from a strong evidence base of what happens in different physiological conditions. The results are limited to one patient.

In clinical genetics on the other hand the analysis crosses from purely diagnostics into research. A common finding in clinical genetics is variants of undetermined significance (VUSs) where the evidence to interpret them is insufficient. Furthermore, even variants where the clinical significance can be determined are done so with reference to available evidence, which means that a scientist's time is taken up gathering evidence of variant pathogenicity.

In simple monogenic conditions which are caused by one (or a few) variants acting in isolation, the scientists who analyse these results are conducting research to determine if variants they see are pathogenic, sometimes by reference to available literature, sometimes by testing relatives to see if the variants segregate, sometimes through in silico prediction tools and sometimes through relevant functional studies. The crucial point of this is that if a variant is pathogenic, then it should be pathogenic in all patients who it is seen in, thus clinical geneticists are building the "reference range" of normal variation and pathogenic variation.

Since this is research that is used to build a reference range that all clinical geneticists should be able to use, there would be a great advantage to, and even need to share data in order to improve the efficiency and quality of clinical geneticist's work. Let us describe a couple of scenarios that would benefit from data sharing.

Differently Incomplete Data

Different scientists can encounter and compile different pieces of evidence for interpreting the same variants, depending on the resources available in their hospitals and the range of patients they see. Variant X could be investigated by two scientists (A and B) working in different labs as follows.

Scientist A works in a large lab and has a research contract with a large Russell Group university and can access a wide variety of literature. They see a single patient who has a variant. The family history is limited, as the patient, a child, has unknown parentage. However as the scientist has access to some limited literature describing functional studies, they are able to assign this evidence to the interpretation.

Scientist B sees the same variant in an unrelated patient in another hospital. Scientist B works in a smaller hospital and does not have a research contract with a university and so does not have access to a large amount of literature where they can search for functional studies. However the patients parents are both known and it can be confirmed that the variant is de novo, with maternity and paternity confirmed.

Both of these scientists alone with these pieces of evidence cannot assign a variant classification other than VUS (class 3) according to the ACMG guidelines (Richards et al. 2015), since each they only have one piece of strong evidence. However if they were to collaborate and share their data, they would both have 2 strong pieces of evidence, which is sufficient to classify the variant as pathogenic. Similar occurrences can happen with other mixtures of evidence, such as a variant seen in an individual with the condition with another known cause, or where a variant is seen in an unaffected individual.

Incomplete Pedigrees

When a patient is seen by a clinical genetics service and the genetic testing subsequently discovers a variant, it is likely that close relatives will be tested (Figure 1). This is to discover if any parents or siblings also have the variant, and if they exhibit any clinical phenotype, and therefore to see if the variant segregates with the condition. This is important for determining the pathogenicity of a variant.

Figure 1: Family 1 (left) has an affected son and mother, who are both confirmed to have the variant. The father is confirmed not to be a carrier, as is the grandmother, however the grandfather is not alive. Family 2 (right) has an affected daughter and an affected father, both confirmed to have the variant, and their mother, and grandfather are both confirmed to not have the variant, and are not affected. The grandmother is no longer alive and there is no clear clinical phenotype available for her. Neither family are aware of the other family.

Often a patient’s available family may be limited. Older generations may not be alive, and as a result there may not be a clear picture of older relatives’ clinical features. Furthermore, family may not be in contact with extended family and be unaware that they could benefit from testing. If extended cousins are also being testing separately for the variant at a different hospital, the hospital that the first family are being tested at may not be aware of the relationship between the families.

Figure 2: After the pedigrees are shared between the labs, and investigating the families’ histories, it was discovered that the families are related to one another, thus the evidence that the variant segregates with the condition was greatly strengthened.

If a relationship between families with a variant can be discovered, it may strengthen the segregation evidence for the variant (Figure 2). However it is important to consider the patient privacy implications of making such a connection, extended family may not be in contact with one another and may not want their medical history shared. This may be an issue where patient privacy comes into direct conflict with the best interests of diagnosing the patients. It may be necessary to discuss with such families whether connections should be made between their cases and those detected in other laboratories for these types of analyses.

Enabling Data Sharing

Although the advantages of data sharing are clear, navigating data sharing is potentially difficult. Although the general utility of sharing data is recognised within clinical genetics, as well as other medical specialties (Callahan et al., 2017), data protection law, specifically the European Union’s General Data Protection Regulation mean that there is concern about the legal status of sharing healthcare data (Neame, 2014), in genomics (Thorogood, 2018) and scientific research more broadly (Chassang, 2017). This has led to concern that the regulatory window between sharing data for patient benefit and protecting patient privacy has narrowed (Phillips, 2018).

A proposed framework for data sharing is the FAIR principles (Wilkinson et al., 2016), which state that scientific data should be Findable, Accessible, Interoperable and Reusable. These were originally developed for online repositories of research data, such as UniProt or wwPDB. They have also attracted interest from the clinical genetics (Corpas et al., 2018), other medical specialties (Callahan et al., 2017) as well as unrelated industries (Rychlik et al., 2018). This doesn’t mean that the data is shared directly; rather it is findable, so that others who may have use for the data can become aware that it exists (findable) and request access to it (accessible). This is particularly important for patient sensitive data, although details about a variant may be interesting to other scientists, and potentially help the diagnosis of other patients, these are still patient results and thus should be protected as patient results. Findable and Accessible doesn’t mean that variants, and the evidence used to classify them should be freely available for anyone to browse, rather it should exist and there should be systems in place to allow people to request access to it where there is a clinical need to access that data.

Furthermore the data should be in a format that can be used with multiple systems (interoperable) to aid its use by other clinical genetic services (reusable). There are a number of common data formats which can be used to share data between different systems. JSON and XML are two widely used standards in web based APIs, which return data in a structured format. These data structures can be readily read and processed by built in libraries in programming languages such as Python and PHP.

Data should be described in a way that is consistent and allows any user to know what the data is saying. There are standards being implemented in clinical genetics to consistently describe variants known as the HGVS nomenclature (den Dunnen, 2017) that provide a clear framework for describing variants. These allow data to be understood by anyone who has access to it, without issues such as whether the position as described is the gene coordinate or the genomic coordinate, or which transcript the variant is found in. Consistent formats like these mean that all scientists in the field can see unambiguous descriptions of variants.

References

Callahan, A., Anderson, K. D., Beattie, M. S., Bixby, J. L., Ferguson, A. R., Fouad, K., Jakeman, L. B., Nielson, J. L., Popovich, P. G., Schwab, J. M. and Lemmon, V. P.; FAIR Share Workshop Participants. (2017) Developing a data sharing community for spinal cord injury research. Exp Neurol. 295:135-143.

Chassang, G. (2017) The impact of the EU general data protection regulation on scientific research. Ecancermedicalscience. 11: 709.

Corpas, M., Kovalevskaya, N. V., McMurray, A. and Nielsen, F. G. G. (2018) A FAIR guide for data providers to maximise sharing of human genomic data. PLoS Comput Biol. 14: e1005873.

den Dunnen, J. T. (2017) Describing Sequence Variants Using HGVS Nomenclature. Methods Mol Biol. 1492: 243-251.

Neame, R. L. (2014) Privacy protection for personal health information and shared care records. Inform Prim Care. 21: 84-91.

Phillips, M. (2018) International data-sharing norms: from the OECD to the General Data Protection Regulation (GDPR). Hum Genet. In Press.

Richards, S., Aziz, N., Bale, S., Bick, D., Das, S., Gastier-Foster, J., Grody, W. W., Hegde, M., Lyon, E., Spector, E., Voelkerding, K., Rehm, H. L. and ACMG Laboratory Quality Assurance Committee. (2015) Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genetics in Medicine 17: 405-24

Rychlik, M., Zappa, G., Añorga, L., Belc, N., Castanheira, I., Donard, O. F. X., Kouřimská, L., Ogrinc, N., Ocké, M. C., Presser, K. and Zoani, C. (2018) Ensuring Food Integrity by Metrology and FAIR Data Principles. Front Chem. 6: 49.

Thorogood, A. (2018) Genomic data sharing in Canada: flying under the regulatory radar? Hum Genet. In Press.

Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J. W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., Gonzalez-Beltran, A., Gray, A. J., Groth, P., Goble, C., Grethe, J. S., Heringa, J., 't Hoen, P. A., Hooft, R., Kuhn, T., Kok, R., Kok, J., Lusher, S. J., Martone, M. E., Mons, A., Packer, A. L., Persson, B., Rocca-Serra, P., Roos, M., van Schaik, R., Sansone, S. A., Schultes, E., Sengstag, T., Slater, T., Strawn, G., Swertz, M. A., Thompson, M., van der Lei, J., van Mulligen, E., Velterop, J., Waagmeester, A., Wittenburg, P., Wolstencroft, K., Zhao, J. and Mons, B. (2016) The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 3: 160018.