AWS Lake Formation is a service that is trying to, according to Roy Hasson, Principal Product Manager at AWS, solve three kinds of main problems: helping customers build and manage their data lakes quickly, address the ability to simplify the managing of fine-grained access security on top of your data in your data lake, and take that data inside of your data lake and share it between AWS accounts.
“Lake Formation,” believes Dipti Borkar, Co-Founder and Chief Product Officer of Ahana, “is going to be the foundational service on top of S3, which will be used by pretty much every mid-market enterprise customer to build out their data lakes and do more with it and get insights from it.”
On top of that, there’s Ahana Cloud, which is a managed service for Presto. Presto is a data lake engine that has become the de facto standard for interactive queries on data lakes. Borkar had the vision to simplify SQL on S3 data lakes. Presto runs on top of data lakes and, as Borkar says it, “makes bringing insights out of S3 data lakes very, very easy.”
The announcement of Ahana’s integration with AWS Lake Formation should excite developers who work with data lakes on AWS because it will allow fine-grained access control. Hasson mentions that customers have been asking them for a single place to manage all data permissions within a data lake and Lake Formation brings that to fruition.
“When Lake Formation was launched to provide ability to define fine-grained access controls and data in S3, we started by supporting AWS native services like Amazon Athena, Amazon Redshift, Amazon EMR, but our customers have always told us that they want to be able to use their choice of tool, Ahana and other third-party systems, to be able to access the data in the same way, but in a secure way,” says Hasson.
When asked about data lake trends, Borkar said that data warehouses and data lakes would continue to co-exist for a long period of time. Hasson believes that data lakes will become core to every business.
The summary of the show is written by Jack Wallen
Swapnil Bhartiya: Hi, this is your host Swapnil Bhartiya and welcome TFIR Let’s Talk. And today we have two guests, Dipti Borkar, co-founder and chief product officer at Ahana and Roy Hasson, principal product manager at AWS. Roy, Dipti, it’s good to have you both on the show.
Roy Hasson: Thanks for having us.
Dipti Borkar: Always a pleasure as well.
Roy Hasson: Yeah, thank you.
Swapnil Bhartiya: Yeah, today, we’re going to talk about of course, Lake Formation in Ahana Cloud for Presto, but before we do go there, Roy or Dipti can you just quickly explain to our viewers, what is AWS Lake Formation all about? How does it help users?
Roy Hasson: So AWS Lake Formation is a service that is trying to solve three kind of main problems. The first one is helping customers build and manage their data lakes quickly. So instead of spending months and months and months building and managing data lakes, we want to be able to make it faster and easier for our customers.
The second thing that Lake formation attempts to address is the ability to simplify, managing fine grain access security on top of your data in your data lake. So as data becomes more available to more users, they want to be able to access it in different ways. So how do you secure it using fine grain controls, column level, role level, cell level security.
And then the third component of Lake Formation is the ability to actually take that data inside of your data lake and share it between AWS accounts. So say you build a data lake inside of one line of business, and you want to share these data sets with other lines of business Lake Formation makes it really easy for you to share that data in a secure way.
Swapnil Bhartiya: Excellent. Thanks for explaining that Dipti, now let’s really talk about Ahana Cloud for Presto. Of course, we have talked about it so much here at TFiR, but it’s good to have a refresher all the time.
Dipti Borkar: Absolutely, Ahana is a managed service for Presto. Presto was created at Facebook now Beta. It is a data lake engine and it is a defacto engine for interactive queries on data lakes. And that makes this integration of Ahana and Lake Formation, really a very good fit. The vision that I had for Ahana was to simplify SQL on S3 data lakes, as well as other clouds. But S3 and AWS is where Ahana runs that’s what it’s built for and we’ll talk more about the integration itself.
Swapnil Bhartiya: And yeah, let’s jump into the integration. As Roy was talking about looking at and solving three problems. Where does Ahana Cloud for Presto fit into the picture and through this integration, what part of the problem you’re trying to solve for users?
Dipti Borkar: Ahana runs on top. So Ahana Presto runs on top of data lakes. It makes bringing insights out of S3 data lakes, very, very easy. However, users still first need to build their data lakes and that’s why lake formation comes in. As more and more data lakes get built out the governance and the security of these open data lakes in particular with open formats, open source becomes really very important.
And Lake Formation, we believe is going to be the foundational service on top of S3, which will be used by pretty much every mid-market enterprise customer to build out their data lakes and do more with it and get insights from it. On top of Lake Formation, you then have the SQL engine that runs which is the Ahana Presto.
Roy Hasson: Yeah and I think that’s right on. And just to kind of add to that, when Lake Formation launched and provided ability to define fine grain access controls and data in S3, we started by supporting AWS native services like Amazon Athena, Amazon Redshift, Amazon EMR, but our customers have always told us that they want to be able to use their choice of tool Ahana and other third party systems to be able to access the data in the same way, but in a secure way.
And that’s really what this is about. So the ability for Lake Formation to expose new set of APIs that allow engines like Ahana to be able to take advantage of Lake Formation security and enable their customers to actually still query the data in S3 through those engines using fine grain as controls managing Lake Formation.
Swapnil Bhartiya: Dipti, if I ask of course, a lot of Ahana Presto users were already leveraging AWS, how does that announcement make their life easier and better? Talk about the improvements that are there.
Dipti Borkar: Absolutely, data lakes will be the defacto standard analytics on data lakes over the next 10 years. I mean, we’ve all been talking about this. However, when it comes to customers, they want simplicity for their access control, right?
Their fine grained access control. What does this mean? Does Roy have access to the customer table? Does Dipti have access to the user’s table? This is defined in Lake Formation and Ahana Presto now exposes this very, very easily and allows customers to add this level of security at the cell level, at the role level, which is really deep integration into this, level of security and so this was not possible before.
Customers like Metropolis and others are now starting to use Lake Formation. They now have the ability to really go very fine grained on their security and their permissions and allow only the users that truly have access to the data have access to it.
Roy Hasson: And just to add to that, right from a simplicity perspective, what customers have been asking us is for a single place to manage all your permissions for data in your data lake in S3. So using lake formation, you have that single place that you can go and you define your permissions in that single place.
And then be able to access the data through Ahana, through Amazon Athena, EMR, et cetera, et cetera, without really having to worry about, does my user have the right permission to the data? Do I need to go to this other system? And map new or configure new permissions? I don’t have to really worry about it. I have a single place that I constantly manage those permissions and those permissions apply regardless of the tool that my users want to use to access that data.
So that’s a really powerful sort of enablement for our customers to be able to actually take their data lakes and actually start scaling it and to get more use of it. Where in the past, when we talked to customers, they said, hey, we built data lakes, but nobody’s really using them, right? They’re all still using their choice of tool. They’re copying data, it’s a real problem.
We want access to more tools and having security inside a Lake Formation in a single place really enables these customers to start doing more with their data.
Swapnil Bhartiya: Dipti, since you mentioned that in the future more and more things will build on top of data lakes. We have been seeing, we talk about data warehouses and then we talk about data lakes both have advantages, but moving, messing on data, as you Roy mentioned can be a big challenge.
So what kind of trend are you seeing? Where are people moving because data is the new oil if I’m not wrong. And then data itself has no value the value is the analytics, AI that you pull from that, and that is where you have to do all the work. And so if you keep moving it back and forth, that is not going to work. So please tell me what trends you see towards data lakes.
Dipti Borkar: Absolutely, happy go first. I see that data warehouses and data lakes coexisting for a long period of time. Over time, just as we see there will be new stacks that emerge data lake is a new modern stack where you could really do more with your data over time with a data in a single location, which is S3, our recording right here is going into S3. There’s metadata, that’s collected on top of it that mean can be analyzed.
And so there is a lot of information that is already in landing in S3. Now that it’s there, we can have many different types of processing on top of it. One kind of processing is SQL, and that’s where Ahana fits in. There is machine and AI and intensive flow of Pieto many, many other systems spark for transformation and many other approaches that all work on a single storage system.
And that is the power of data lakes. If they’re open they cross these open formats of [parkey 00:08:18] or C and others can be used across many different types of processing. And at the end of the day the customer, the enterprise that’s data driven gets the most value of this information over time by process having many different processing on top of it.
Now, that said there is an element of ease of use and speed, because we’ve had big data 101 if you will or 1.0 with Hudu where there was a lot of tech, but it took a long time to see the value of it.
With this next generation of Ahana and managed services that makes it easy. The operational overhead is offloaded to the managed service and SQL engines like Ahana then bring that value out of these data lakes for different workloads. It might be SQL, it might be machine learning and others. And I see this as the next 10 years of data, it will coexist with data warehouses, over time more and more analytics will be on open data lakes.
Roy Hasson: Yeah, I totally agree. I mean, I think you hit all the points. I think ultimately it comes down to breaking down sort of the data silos and bringing the data into a place that is highly, highly scalable, durable, available, and is well integrated with pretty much everything.
If you talk about object store like Amazon S3 there’s very few things or tools services out there that don’t already know how to talk to S3. And then whether you use Ahana on top of that or you use a data warehouse like Redshift on top of that data, it’s really your choice. It’s up to your use case and what you want to do.
Locking that data into one particular data store because it’s a little bit easier or because it may be a little cheaper, it makes sense in some scenarios, but as you start scaling as your users come back and say, I want to do more with my data.
You know, I don’t want to wait for this tool to give me the next feature that I need, and it’s going to come in a year, I want to go pick up this open source thing over there and I want to go play with it data lakes give you that flexibility and that agility where putting the data into a single system, it may look like that system is flexible and accessible and that may be okay, but I think it also limits your ability to move quickly and do more with the data.
So I really see the future of data lakes over the next several years, just becoming more and more core to every business and whether we build some of these new features on top of the data lake like asset transactions and updates and deletes and time travel. These are things we’re all used to, and we love in databases, but now we’re starting to build those things on top of a data lake in a much more scalable and distributed manner.
So I think over time, you’re going to see more and more customers and systems just moving to using a data lake as the defacto storage and data management system.
Swapnil Bhartiya: Since you’re talking about a trend going in future, what about also talking about a lot about crypto also in UDB kind of databases also there, where you do want to ensure that all the transactions are being tracked. Do you also see if you look at we used to say, hey, encryption is only for business, but now everything is end to and encrypted whether it’s messaging. So what kind of trend do you see in the database space when it comes to, hey, you know what, it’s tamper proof? Do you see anything there DROI or is that out of context right now?
Dipti Borkar: Yeah, I think that the OLTP and the analytical O lab systems are very different. And that’s where you have polyglot per systems or you have the right database for the right purpose, the right job. OLTP systems, operational systems, whether they need transactions or not rate high throughput, massive high throughput microsecond data C systems will always exist.
But then you have to move that data into one place and look at the analytics on it and analyze how to leverage that information and do something with it. The product manager can create new products. The marketer can create new campaigns. The salesperson can understand what territories and customers to attract and that’s what analytics is about.
It is about the business and making the most of creating new businesses or expanding existing businesses. And that is where data lakes fit in. I think the innovation on the OLTP operational side will continue to grow. That’s where the crypto, Bitcoin, all of these things come in and they will always be databases on the microsecond latency level.
But that’s a different game from the five, 10 wave joints, across many different tables, star schema, snowflake schema, which is what analytics is about, getting deep information, deep insights of data on a historical basis or a correlated basis.
Roy Hasson: So just to add to that, and I’ll take it to a slightly different direction. So everything Dipti said is right on right with the database and OLTP and O app. A pattern I’m kind of starting to see more and more is traditionally analytics was sort of done I would say, like out of band. Like out of band of the transactional systems. And when we look at the data, maybe it’s real time data coming in or batch data coming in, doesn’t really matter.
I’m kind of looking at I’m making decision kind of like after the fact and what we’re starting to see now is that transitioning. So analytics transitioning towards being in line with the business. So in a transactional system, I go click a button and I purchase something. A transactional system handles that store that data and moves that forward analytics gets slotted into that path as well.
So I purchase that piece of product, the next step is data get stored and then it triggers some ML algorithm or some other system that goes and figures out, what’s the best way to get that package to me. Should I be shipping it from this warehouse over here or this warehouse over there?
So both analytics and machine learning are starting to become less about reporting and analysis and more about quickly analyzing the situation and providing the right answer, the right direction to take the next step in that pipeline.
So I’m definitely seeing more and more of that. That’s why we’re starting to see databases like Apache Pinot that’s taking that data and really giving you that fast response, because you want your business systems to be driving decisions based on that data very quickly.
Dipti Borkar: Absolutely, and Presto allows for that federation across Pinot and data lakes and give you that insight across many different systems.
Swapnil Bhartiya: Perfect. Thanks for taking that caution and addressing that. Now, let us scale back to our initial discussion which was about Presto Cloud. Can you share some of the exciting use cases that you have seen for which further validates this integration.
Dipti Borkar: Absolutely. So we have Ahana customer metropolis is a joint customer of AWS as well as Ahana. Is an early user of Lake Formation and they’re in the mobility space. They have the ability to enable parking and many different mobility services across many cities and metropolises just like the name says.
And with this integration of Ahana and Lake Formation, now they can enable fine grain access control as we’ve talked about at a much granular level that gives them the peace of mind, the data platform team, the peace of mind that only the right people, the right agencies, the right companies have access to the data that they have access to. And as they grow across many, many more cities, they will have the ability to have this fine grain access control and peace of mind.
Swapnil Bhartiya: Excellent. Now, of course, you folks cannot reveal a lot of stuff. You cannot talk about a lot of it it’s still in the pipeline, but if I ask, what are things that you folks are working on? What to expect next.
Roy Hasson: So at Reinvent actually, a couple weeks ago almost, we announced the release of raw and sellable security which is further advancing the security of Lake Formation. The other thing that we announced is something that we call govern tables. So again this is back to one of the core problems that Lake Formations tries to solve with making data easier to manage in S3 as this govern table, give you the option to ingest data into the data lake using asset transactions.
We also have automatic small file compaction that manages the data behind the scenes and optimizes it for better performance. So these are some of the big things that we announced at Reinvent. And of course, we’re going to keep working throughout 2022 to enhance those capabilities, add more features for our customers to make data ingestion and data management easier, but also enhances the security posture of your data lake and makes defining and auditing and managing your data security easier for our customers.
Dipti Borkar: Absolutely, and we’ve integrated with these APIs for cell level and role level integration that gives that fine level access control to our users. As AWS builds on more, we will go deeper and deeper and build out those deeper integrations at Presto Con last week we announced this integration and now it’s available for our customers to come in and try it out.
And so I would call to action is give it a try if you are in the process of building a lake house, build it right, build it with governance and security built in. And now this is available with Lake Formation and with SQL analytics on the top, that’s easy to use with the Ahana.
It really is very, very simple. It can be integrated in 15 minutes. This would take months and months of effort to get set up in the past as we know with databases. Now with this ease of use as data platform teams, you can be a 2, 3, 4 person team and still get the best value and the key features’ like security built in.
Swapnil Bhartiya: Roy, Dipti thank you so much for taking time out today and talk about of course, cloud formation, Ahana and all the trends that you’re focusing on. And I would love to have you back on the show. Thank you.
Dipti Borkar: Great. Thanks so much, Swapn. Thanks Roy.
Roy Hasson: Thank you. Thank you Pni.
Dipti Borkar: Thanks again.