Noa Schork
Faculty Sponsor: Dr. D. Thurtle-Schmidt
On the NCBI website there is a database called ClinVar, which contains every clinical variant discovered by doctors and their phenotypic effects. The mechanisms by which many of these variants effect proteins and disease phenotypes is not yet understood, creating the necessity for studies in model organisms. NCBI provides tools to find variants of a gene or to compare genes across species, however there is not yet a tool that compares conserved regions of a gene and then compares the conserved regions to known clinical variants. We wrote a Python script that aligns human sequences with orthologs sequences to identify conserved regions that correspond with clinical variants as annotated in ClinVar. Once the conserved regions have been identified, users can introduce similar mutations in the model organism and study its effects. The rise of genome editing techniques makes creating mutant model organisms accessible to most molecular biologists. The program successfully finds clinical variants that are found in fully, strongly, weakly, and not conserved regions for a given gene. It exports this information into two files; one has the alignment, conservation symbols, and variant symbols, and the other is a tab-delimited file with the clinical variants organized by conservation. For the human gene NR5A1, there are 37 variants listed in the ClinVar file. Eleven of these variants occur at fully conserved regions, two are at strongly conserved, five are at weakly conserved, and nineteen occurred in regions that are not conserved between humans, flies, frogs, chickens, and worms. As indicated in Figure 3, the majority of fully conserved variants were pathogenic, whereas the weakly conserved variants had an even spread across pathogenic, benign, and unknown significance. Eventually the program will be put on a server so that anyone can access the program, where they are, without downloading the necessary files.