The Center for Data Innovation spoke to John Myers, chief technology officer of Gretel.ai, a San Diego-based startup that helps developers use data efficiently while implementing key privacy safeguards. Myers discussed some of the technology used in privacy engineering along with its benefits and drawbacks.
Gillian Diebold: What is the central problem that Gretel aims to address?
John Myers: Gretel wants to enable teams and organizations to better innovate with data. Today, there are several bottlenecks when it comes to sharing or accessing data. These bottlenecks include data silos with strict access controls and long approval processes for access to production data.
We believe that these bottlenecks can be removed if the right tools are available to create safe versions of data. If developers are able to create safe versions of data, then teams and organizations can more quickly share and access the data, which means faster innovation, product development, and general problem-solving.
Diebold: What is synthetic data, and what are its benefits and drawbacks?
Myers: Synthetic data is like a “clone” of data that looks and feels just like the real-world data that it is based on. Our machine learning capabilities will take some original data and learn how to create brand new records that are an effective representation of the original ones. This synthetic data will have all of the same statistical properties and valuable insights as the original data and the Gretel service specifically ensures privacy risks have been removed.
The benefits of privacy-focused synthetic data is that it is safe to share due to privacy protection measures in the Gretel service, it contains all of the statistical properties and insights as the original data, and it can be used to solve problems you necessarily cannot solve with the original data. For example, with synthetic data, you can create new records to solve for an imbalance or bias that exists in the original dataset. Most data is biased, so the analytic results of that data are usually biased as well. You can synthesize records that contain attributes that may be underrepresented in the original data, such as generating more records that contain specific genders, ages, races, or ethnicities, this way your analysis of that data will be less biased.
On the other hand, the drawbacks of synthetic data is that it requires complex configuration of compute environments, potential knowledge of advanced machine learning, and it is difficult to assess the privacy and efficacy of the synthetic data. At Gretel, we recognize these drawbacks and have actually made these offerings as part of our product for this reason while keeping our core technology open source. Gretel can automatically manage systems for you in the cloud, removing the need to understand machine learning in-depth, and every time you generate synthetic data, we provide a “report card” outlining the usability and privacy of that synthetic data.
Diebold: Gretel claims to be a “developer-first” company. Can you explain what that means?
Myers: Being a “developer-first” company means that our tools are designed to be used by developers as one of our primary users. We orient around serving developers because they are on the “front lines” of solving an organization’s most challenging problems, which almost always is data-centric. This also means that developers are on the front lines of privacy. There are two core tenants when it comes to building a product for developers: accessibility and transparency.
First, we aim to take a variety of very complex privacy engineering tasks and make them easily accessible. Data labeling, transformations, and synthetic data generation are very complex and discrete tasks. In order to make these tools easier to use, we have built Gretel as a service, backed by APIs which streamline privacy engineering tasks in a unified way. Developers can interact with these APIs by using our cloud-native console, command line interface (CLI), or software development kit (SDK). It’s free to get started and Gretel will offer a “developer” plan that gives access to our full suite of privacy engineering tools.
Second, Gretel has an open-source core. We’ll always be transparent with the developer community and our core libraries, the exact ones we even use in our service, will be open source and free forever to the developer community. Privacy is a zero-sum game, and we’ve decided to build on top of an open-source core so anyone can peek under the hood to better understand the nuts and bolts of what we do and to get feedback from the broader developer community.
Diebold: Is synthetic data the future? What other tools do you think can solve the privacy dilemma?
Myers: Synthetic data alongside privacy engineering, as a critical step in the creation and consumption of data, is the future. Synthetic data is a very powerful and flexible tool that is part of a larger overall toolkit. We think, due to the various benefits of synthetic data, it will eventually supersede the use of real data.
Other tools for privacy engineering include natural language processing, named entity recognition, traditional data transform techniques such as pseudo-anonymization, tokenization, and encryption are all tools that should be considered depending on a team’s needs.
Gretel aims to combine these tools under a common set of APIs that are easily accessible and scalable. By combining all of these tools, organizations, teams, and developers can choose the right combination in order to create safe data for sharing and collaboration.
There are also a large variety of complementary tools that can be used to operationalize privacy engineering with Continuous Integration / Deployment (CI/CD) workflows and Extract, Transform, and Load (ETL) automation. Tools like GitHub, GitLab, Airflow, Prefect, Airbyte, and Dbt all provide great injection points for privacy engineering tools in a broader data engineering pipeline.
Diebold: Much of your team consists of veterans and former federal employees. How do you think those experiences shaped Gretel’s mission?
Myers: My co-founders and I are veterans of government service and the commercial industry by way of previous startups and working for large publicly traded organizations. Our past experiences all shared challenges with privacy engineering and were the motivation for us to work together and launch Gretel just over two years ago. Our collective challenges led us down the path to believing that privacy is a challenge rooted in developer workflows and shapes the ethos for the company.
Gretel’s mission, however, has largely been shaped by each and every person we’ve hired since launching the company. The one really unique thing about Gretel is that we have a highly diverse team, drawing from a variety of industry verticals, job disciplines, and personal experiences. The one common theme, among everyone, is that we’ve all experienced the challenge of sharing or accessing data in a frictionless, easier way. Everyone is committed to building a product that we’ve all wanted to use in our past lives, so now we’re on a mission to build a privacy engineering product for every developer out there.