From GenoSeq
Data Management and Quality Control at the Core
The very large volumes of data produced by the current genetic techniques make data management and quality control of paramount importance. Relevant issues include collection, cleaning, management, archiving, integration, security, presentation, and dissemination of data. Scientists in the UCLA Department of Human Genetics have developed novel methods for data quality control which are implemented in an integrated data system used to store and analyze all genotype data generated in the UCLA Genotyping Core.
Once the genotypes have been called, they are imported into a Microsoft SQL Server database for error-checking and data-cleaning. This database holds all the genotypes generated in the Core, as well as information on the individual projects, such as locus information, pedigree information, phenotypic data, tissue source, DNA concentration, sample location, and instruments and technicians generating the data for the project. The database also holds marker information such as size range, heterozygosity, allele frequency, and map orders.
Scientists in the UCLA Department of Human Genetics have developed statistical methods to trap errors and allow a more accurate data set to be passed on for further statistical analysis. The quality checks are both local and global. That is, each genotype is evaluated independently according to a number of quality parameters, then the overall data set is judged by population-based statistical methods[1]. These methods are relevant for datasets both with and without family structure information. Finally, for pedigree-based datasets, a statistical analysis providing posterior mistyping probabilities at each genotype is performed[2]. All results obtained during the quality control process are fed back into the relational database. The use of a relational database confers the additional advantages of improved integrity, management, manipulation, and presentation of the considerable amounts of data generated in large genome studies. Tests can be applied during the course of a study, as more data becomes available. The results obtained are not only useful to correct errors contained in the genotype data, but are also used as feedback to improve and streamline genotype interpretation and experimental protocols at the Core. When the data have been thoroughly checked and validated, the results can be exported in a variety of formats for analysis by different statistical packages.
Special consideration is given to the issues of data security and patient confidentiality. To safeguard patient confidentiality to the highest degree, no information that could identify a patient is stored in the Genotyping Core databases connected to the network. Serial back-ups of the databases are stored at a remote site. Raw image data is maintained on line for a period of several months while the data is likely to require frequent referencing. After this the image files are archived permanently.