The Center for Data Innovation spoke with Doug Cutting, Chief Architect of Palo Alto, California-based software company Cloudera and co-creator of the Hadoop data processing framework. Cutting discussed when businesses should consider an enterprise big data software solution and how he feels about Hadoop’s massive popularity.
This interview has been lightly edited.
Travis Korte: For those who may be unfamiliar, could you give a brief introduction to Cloudera, who uses it, and for what?
Doug Cutting: Cloudera provides software, support and services for enterprise data management. The Apache Hadoop open-source project heralded a new way of storing and processing data. It demonstrated that enterprises could manage their data with greater flexibility at an order of magnitude less cost. The architecture Hadoop delivers is one where, instead of moving data to applications, applications can be brought to the data. Gone are silos, replaced with a general-purpose, centralized storage and compute resource that supports the full range of analysis and processing needs, including SQL, NoSQL, search, streaming, etc.
Cloudera fills the gap between the raw, open-source software and sophisticated enterprise needs. Enterprises need support: someone they can call when things don’t go as expected. Cloudera also provides operational tools to ease installation, configuration, and monitoring of the software. We provide advanced data management tools that secure systems, audit actions, and track data provenance.
We call this combination an enterprise data hub (EDH). It provides an organization with a single place they can confidently manage all of their data, implementing their extract-transform-load (ETL) engine, data warehouse, online archive, production search system, etc. With an EDH, folks can extract much more value from the data their business generates.
TK: What are some of the primary obstacles for organizations hoping to implement large-scale data processing initiatives?
DC: The EDH is a new platform. Lack of familiarity slows adoption. Cloudera offers training courses to help folks through this transition, but there’s still a learning curve. There are still also some missing features and rough edges here and there, as the EDH catches up with prior technologies whose implementations have had more time to mature. Lastly, there’s just inertia. Institutions are reasonably reluctant to fix what’s not broken. Folks should deploy an EDH as they need it, when existing solutions are hampering their business either through inflexibility or cost, not just because it’s the shiny new thing.
TK: Cloudera helps bring Hadoop to the enterprise. Have you, or do you have any future plans to, work with government agencies in the same way?
DC: Cloudera has a substantial government business. We have customers in the defense and intelligence communities as well as in the civilian sector.
TK: Hadoop is now deployed in an impressively broad array of fields. Do you still get surprised by new use cases? Any favorites you’d care to share?
DC: I love finding when products I use are powered by Hadoop. For example, I’m a customer of Netflix, Chevron, and Citibank, who all use Hadoop. Then there are companies that I would never have guessed would come to use Hadoop, like John Deere, Caterpillar & BNSF. Lastly there are those that are just awesome, like Skybox Imaging.
TK: Relatedly, you’ve watched as Hadoop has grown from a humble search utility into a global driver of data processing. Was there a point when you thought “this might be huge,” or did you have that sense all along?
DC: I am as surprised as anyone at how popular Hadoop has become. When Google published its papers I realized that a general-purpose scalable computing platform like they described would be useful to lots of folks outside of Google and that an open-source implementation would be a great avenue to deliver this. But I was mostly thinking about those few folks building search engines and doing academic web research, not the Fortune 1000. I’d not yet realized the extent that technology and the data it generates would permeate nearly every industry, that the web was just the vanguard and that the rest of the economy would soon follow. It’s been quite a ride so far!