Our large pharma client needed help to develop a platform to offer generalized data pipelines for complex data types across all clinical trials and indications to enable data science analyses. The platform is designed to process clinical trial data for many different types of wearables and biosensors including actigraphy, ECG, PPG, and more. The client was able to reuse existing preprocessing and feature extraction pipelines for specific modalities, such as ECG. However, these legacy pipelines were executed in Matlab, and were bespoke to specific clinical trial designs. The Matlab version required costly licensing and was not friendly toward deploying the pipeline at scale, in the cloud. Further, the pipeline lacked handling for input validation, missing data, multiple leads, datetime tracking, and other functionality to make the pipeline fully generalizable to any new incoming clinical trial dataset.
We converted the Matlab pipeline into Python, a free, cloud-friendly language. Unit tests were utilized to ensure example Matlab and Python outputs matched exactly. We added additional functionality to ensure the pipeline would be generalizable and handle different input structures and a varying level of input data quality. This Python version was implemented at scale in the cloud within the company’s strategic platform, that builds from a git repository, importing the module and accepting inputs from S3. This transformer was designed to be accessed and deployed within their user interface. This generalized GxP compliant ECG pipeline can be used to support digital biomarker discovery across many clinical trials and indications, with differently structured data and study designs.