Databricks Strengthens Governance and Secure Sharing in the Lakehouse

Data governance is one of the four pillars needed for the future of AI, along with backward-looking analytics, forward-looking AI, and real-time decision making. To that end, Databricks rolled out several new governance capabilities for its unified Lakehouse architecture at the Data+AI Summit, including the GA of Unity Catalog and Delta Sharing, and the unveiling of Databricks Marketplace and Cleanrooms.

Anyone who has had to manage big data knows that data governance for AI is very complex, says Matei Zaharia, a Databrick co-founder and its CTO. For starters, it’s difficult to control permissions on disparate data repositories. Some repositories may support setting granular row-level and column-level restrictions, while others, such as Amazon S3, do not support this approach.

“And it’s also very difficult to change your data organization models,” Zaharia said during her keynote at the Data+AI Summit on Tuesday in San Francisco. “You have to move all the files if you want to change your directory structure. So it’s already a bit embarrassing.

“On top of that, you probably want to think of your data as tables and views,” the creator of Spark continued. “So you could have something like the Hive metastore where you set permission on tables and views. of confusion.

Managing data permissions and access control in a busy lake can be a big challenge (lucadp/Shutterstock)

Data warehouses will generally support a richer approach based on SQL statements and GRANTS, he said. “And then you have many other systems, like your machine learning platform, your dashboards, etc., and they all have their own way of doing permissions and you have to make sure your policies are consistent across all of these areas.

The company addresses this hodgepodge of data governance approaches with Unity Catalog. Databricks first unveiled Unity Catalog a year ago at the Data+AI Summit, and announced yesterday that it would become generally available on AWS and Microsoft Azure in the next weeks.

Unity Catalog provides a centralized governance solution that brings features such as integrated search and discovery and automated lineage for all data workloads. The product applies permissions to tables using ANSI SQL GRANTS, Zaharia said, and it can also control access to other data assets, such as files stored in an object store, via REST. .

Databricks recently added support for lineage tracking, which Zaharia says will be very useful for a range of data assets. “It allows you to configure and follow lineage on tables, columns, dashboards, notebooks, drops – basically anything you can run on the Databricks platform, and see what kind of data and who uses them downstream,” he said.

Delta share

Companies are starting to step up their data sharing with partners and others. The reason, of course, is the potential to develop better insights and train more powerful AI by augmenting their own data with data from organizations in the same industry. According to Gartner, customers who are part of a data sharing ecosystem can expect to see their economic performance increase 3 times compared to their peers who do not share their data.

Databricks Delta Sharing is now GA

The challenge then becomes how to enable data sharing while maintaining some semblance of control over the data and minimizing the need for extensive manual data processing. One mechanism Databricks has created is Delta Sharing, which is another previously announced feature of its Lakehouse that will become GA in the coming weeks.

Delta Sharing allows customers to share data across multiple platforms through a REST API. “Basically, any system that can process Parquet can read data through Delta Sharing,” Zaharia explains.

Any customer with a Delta table can share their data, even if they are on different clouds. All it takes is for them to have a client with a Delta Sharing connector, like a Spark shell, Pandas or even PowerBI, he says. The transfers happen quickly and efficiently, Zaharia says, because they use “a cloud object store feature that lets you give someone temporary access to read a single file.”

Since the unveiling of Delta Sharing a year ago, usage has started to take off. According to Zaharia, more than 1PB of data is shared every day using Delta Sharing on the Databricks platform.

Marketplace and clean rooms

The maturation of Delta Sharing has led to two additional new products: a Databricks marketplace and clean rooms.

The new Databricks Marketplace is based on Delta Sharing and will allow anyone with a Delta Sharing enabled client to buy, sell and share data and data solutions. The offer will fill gaps left by data markets that do not meet the needs of data providers, Zaharia said.

Data clean rooms are emerging as a way to securely share data with other organizations (hvostik/Shutterstock)

“A limitation is that every market is closed,” he said. “It’s for a specific cloud or a specific data warehouse or software platform because the goal of these vendors is to get more people to use their platform and pay them. That’s fine for these sellers. But if you are a data provider and you have worked hard to create a data set, it is really annoying that you have to publish up to 10 different platforms just to reach all users who want to use your data set.

The Databricks marketplace is also not just about trading data, but also trading code, such as notebooks, machine learning models and dashboards, Zaharia said. “We’ve…set it up so that pretty much anything you can build on the Databricks platform, you can publish to the Databricks Marketplace to give someone a complete app,” he said. .

Databricks Cleanroom will be available in the coming months. The company does not plan to charge any fees at this stage.

Finally, Databricks is launching a new Cleanrooms service, which will also be available in the coming months. According to Databricks, the service will provide a way to share and join data between organizations in a secure, hosted environment.

A key aspect of clean rooms, which is also based on Delta Sharing, is the elimination of the need to manually replicate data. It will allow users to collaborate with their customers and partners on any cloud and provide them with the flexibility to not only share data, but also run computations and workloads that leverage SQL as well as science tools. data using Python, R and Scala.

Related articles:

It’s not ‘Mobile Spark’, but it’s close

Why Databricks Delta Lake Table Format Open Supply is a Big Deal

Databricks Unveils Data Sharing, ETL and Governance Solutions

Helen D. Jessen