Data Standards and Integration for Biomedical Research

Ptolemy.V™ is our newly released data integration solution. It makes it easy for researchers to discover data of interest and recombine it into new integrated output data sets, while applying and leveraging data standards – even when data standards were not employed in the original source data.

To make data visible to Ptolemy.V™, it must be imported and registered. Ptolemy.V™ can scan the source data and automatically generate a data dictionary for your source data set during the registration process. Ptolemy.V™ stores the data element names and type of data (classification, date, number, text, etc.) in its meta-data repository along with any additional descriptive data that is provided (such as data descriptions, usage, time stamps, etc) and then automatically searches for potentially related data elements in its data standards repository. (This can be preloaded with standard data elements such as those from the caDSR or from NINDS.) During registration, Ptolemy.V™ also searches source data sets that have already been previously imported/registered for additional potentially related data elements. Users can then identify relationship of interests which are recorded in a special database optimized for storing and searching for such data relationships.

Next, Ptolemy.V™ imports and stores the source data associated with each data element in its repository. For uploading and importing, Ptolemy.V™ can access data in your local environment or can connect to a variety of cloud storage sources. Ptolemy.V™ comes with a flexible and scalable raw data repository specially designed and optimized to support columnar data but it can easily utilize your Big Data infrastructure instead (or in addition to its built in raw data repository).

Ptolemy.V™ provides a powerful full-text based search facility that allows a researcher to find data elements based on the name, values, descriptive information, kewords, etc. From search results, a researcher can select data elements of interest and review the source data sets from which a selected data element originated and other data elements those data sets contain. It provides a means for a researcher to ‘browse’ the source data associated with a data element, automatically generating key statistics such as a list of unique values that appear in the data, the total number of values, the number of null values, and graphical data visualizations for each data element, all helping to inform the researcher about the data element and the actual data.

Ptolemy.V™ allows the researcher to select and incorporate one or more data elements in an output data set. Of course, it will also copy the data associated with selected data elements into the output data set. In many cases, the data from each source will need to be converted into a consistent output format. Ptolemy.V™ allows the researcher to create and store a conversion routine that translates the imported source data into the desired output format and it uses these conversions to generate the desired output data set on demand. Data can be converted from any registered data element into any related data element including related standardized data element making it easier for researchers to take advantage of standards. Output data sets can be easily downloaded in a common format that can be imported into tools such as SAS, Excel, or R.

Ptolemy.V™ stores the selected data elements and conversion routines together in a form that can be easily edited and re-executed. This makes it much easier for a researcher to add data elements or regenerate a new version of the integrated data when the source data is updated. Moreover, users can make their own copies of data element selections and conversions to create their own copies of the output data set for download or edit them to create their own variations of the output. Herein lies the power and innovation provided by Ptolemy.V™: the ability to enable data reuse for a whole community of researchers, to easily accommodate changes and additions in the source data, and to easily regenerate integrated output data sets all while maintaining conformance to data standards.