The Center for Data Innovation spoke with Girish Pancha, co-founder and CEO of StreamSets, a U.S.-based data operations platform. Pancha explained how building better data pipelines can reduce downtime caused by changing data structures.
This interview has been edited.
Morgan Stevens: What was your motivation for starting StreamSets?
Girish Pancha: Arvind Prabhakar and I co-founded StreamSets in 2014 with a vision that data should be the lifeblood of the enterprise. We strongly believed data could drive the next advances in digital transformation with operationalized data analytics rather than just sitting in warehouses and lakes.
A new reality hindered that vision—the constant and accelerating pace of change in data platforms, structures, and semantics (i.e., data drift). In a world of data drift, we firmly believed that DataOps principles had to be embraced from the ground up in how data flowed in the enterprise and across enterprises. Thus we built and launched StreamSets Data Collector, which evolved into the StreamSets DataOps Platform, an enterprise-proven data integration platform designed for the modern data stack, with the ability to tackle data drift embedded at its core.
Stevens: What is data drift, and how does it affect business operations?
Pancha: Data drift is the constantly changing data platforms, structures, and semantics. Data drift wreaks havoc on downstream data analytics and business operations.
For example, consider what seems like a simple transition from 10-digit ID numbers to 12-digits in a core business application. This may affect thousands of applications and data consumers in an enterprise. If known ahead of time, this leads to scheduled downtimes for data pipelines. If undetected, data pipelines break or cause data loss and corruption. Data consumers have false or missed insights, loss of data trust. Data engineers and operators are constantly fire fighting, and performing endless janitorial data processing, instead of focusing on new business imperatives.
Stevens: How does the StreamSets DataOps Platform optimize data management and integration?
Pancha: StreamSets built our data pipeline capabilities on three fundamental principles of DataOps: continuous design, continuous operations, and continuous data observability. Architecting for change based on these principles meant that pipelines are both highly decoupled at design time and instrumented for run time monitoring and manageability. These smart data pipelines make it easy to respond and adapt to new business conditions and innovations with speed and agility.
Stevens: What are the biggest challenges with building and governing smart data pipelines at scale?
Pancha: The challenges lie in the fact that the end-to-end data supply chain typically operates with a variety of data processing patterns (ETL, ELT, and ML), modes (batch and streaming), and throughputs (monotonically increasing vs. bursting). Today’s tooling and skill landscape is increasingly fragmented, making building data integrations bespoke and non-repeatable. This wastes developer productivity, and makes it impossible to deliver freshness, quality, and privacy service level agreements (SLAs) for the end-to-end data supply chain. Too much time gets spent hunting down root causes for SLA breaches, inevitably harming the business.
Stevens: How will data integration and management evolve in the coming years?
Pancha: Given the fragmented nature of data sources in the enterprise, data integration will remain a hybrid, multi-cloud technology for the foreseeable future. Cementing of DataOps practices across the modern data stack, with best practices and architecture templates such as data meshes, will mean that data pipelines will become more and more invisible to the enterprise. Data integration will evolve to embrace next generation data catalogs that can be always up-to-date, and be able to operationalise adjacent disciplines such as master data management (MDM) and data privacy. In the future, the business will start focusing not on the details of data integration but on its output as data products.