Security Panel – The Cybersecurity Show – S1Ep10
ANDREW MCLEAN [00:00:29] And welcome to this episode of Security Panel. The greatest cyber security show in the universe supported by celebrity. Today, we're gonna be talking about open source data security. And it is my absolute pleasure to be joined by technology geek and superhero Moyer Brannan from the Egeria Project. Welcome. Well, like all superheroes, could you tell us your origin story?
MOYA BRANNAN [00:01:01] Oh, my gosh. Okay, so geeks, an interesting one. So I'm currently working as an open source contributor on Egeria Project, which is part of the Linux Foundation under the Open Data Platform Initiative. So we'll talk about it in a little bit, but it's all about metadata. I've got to say, it's one of the most exciting parts of my career I've had. So it's a completely different world and very interesting. So it'll be good to talk about that in a little bit. I'm working on that project, though. There's a whole host of country companies contributing. And I'm actually funded through IBM. I am so IBM employee. I've been at IBM for quite a while. Came through, I think, acquisition of Informix. So you can see there's quite a background there of databases and analytics and and such. So I suppose prior to that, again, more data, more analytics. And for my sins, I sort of studies machine learning sort of over 25 years ago, which at the time was exceptionally unpopular, but of course, all the rage now. So also works as an architect at IBM and but never massively in security. its actually something that's quite interesting is that I'm working on currently. And although Egeria is very much focussed on better data, security is something that plays into that. On a frequent basis, because data is exceptionally important asset to to any company, many companies use it to monetise. Other companies want to use it to become well. I think most companies want to use them and enable them to become data driven. It holds so much value. And the last thing any company wants is, is to have, you know, that data abused. They want to ensure that staff aren't using incorrectly because there's many regulations that surround data. But the problem is there's a huge amount of data. We're always being told how much data there is in the world. And and organisations will often find themselves in the situation where they don't understand all their data or assets. And when I talk about data assets, I'm not just talking about data that we use for analytics business information, but just data throughout the entire organisation, because every application, everything that you do within an organisation, even security monitoring all generates data. And that data has to live from reside somewhere. And there's always someone wanting to analyze and utilise various parts of data. And over the past two decades, you know, decades and decades, decades, we've had an evolution of ways that we handle or we look after and we make use of that data. And so a lot of organisations have got exceptionally complex data landscapes. And it's not just through the evolution of the business, but it's also through acquisitions and, you know, company mergers. And also when companies you'll find a lot of departments may go off and buy side I.T and bring in more capabilities that aren't necessarily completely governed. And just the sprawl and the growth for many companies, they often think it's too much of a headspace, as they say for a lot of organisations to actually, you know, manage all the data assets they've got. So Egeria, the project I'm working on seeks to resolve those issues.
ANDREW MCLEAN [00:04:34] We'll say, come on to Egeria in a second, whilst we're still on the topic of open source. Do you think companies approach open source with different considerations regarding their cybersecurity than they would maybe a closed system?
MOYA BRANNAN [00:04:50] So open source is it's been it's been very interesting actually going from a commercial company and then working on an open source projects. And I mean, over the past sort of. I should say, I would say sort of 10 to 15 years I've been working and working as an architect with direct customers of IBM. You know, 10, 50 years ago, if my source was very much sort of frowned on, as everybody knows. And then, of course, it's become companies have become happy to sort of adopt it, provided they can get support in place for it. And theres available services to deploy that so I know with Egeria. Which I'm not saying the security companies are very excited to talk to us about bringing open source in the caveat that they always have is who will support this? And can you support this? And, you know, will this be around? And being a Linux Foundation projects is a fantastic thing for us. And of course, it is feasible to sort of get support contracts for that. I mean, there's some interesting things about open source, because the way it's driven is is very different from how commercial products are put together. So a commercial product will have a very stringent roadmap, will have very definite goals, will be driven by certain know crucial capabilities that want to appear in that, whereas open source, of course, is is very much developed by the people who contribute to it. So the contributions could be organisations. So like for Egeria is ING Bank, SAS Institute, Cloudera and IBM and and the people contributing to that will will drive the direction and the capabilities that appear through what's of importance to them. So one of the things that's very important for Egeria is lineage. So we've seen that, you know, develop quite significantly. So it's in certain areas, it's very innovation all. But again, it's open source. So one of the things for Egeria that's really important for us is, you know, anyone can take this code. And so as we're developing it, you know, we're running organisations to take the Egeria code and embedded in their applications. And and we've seen this start to happen. You know, people are talking to us about using Egeria not not sort of well, partly, you know, organisations want to use it to unify their landscape across all their data assets, but other organisations are wanting for their individual application to use it as a catalogue capability or a massive data hub to hold their application information such that they can then share it with other applications and tools. So there's more acceptance, more readiness. And I suppose the issue is, is, you know, if you're going for something, it's open source. You unless you're going to start contributing as a company, which a lot of organisations will you. You've possibly not directing the open source because it's a community led thing. Whereas if you're working with a commercial vendor, you know, you you've got the ability to sort of demand I must have feature X and Y and I must have this. And you've got a very definite roadmap. You know exactly where you're going with that capability. So two very different styles, both work. And I think I think that's a huge compliment that you can have between the two.
ANDREW MCLEAN [00:08:15] So you've mentioned Egeria project a few times. Well, let's just start at the beginning. What is the Egeria project?
MOYA BRANNAN [00:08:23] So Egeria as is a way to look at resolving all those data issues, I must mentioned a lot of data issues before and what happens with an organisation. So you have all these lovely applications, all these data stores, they're all effectively like silos. And sharing information between them, you know, is not the simplest thing to do because there's no open standard. And the way that data has moved with analytics is, you know, there was a huge movement to go from data warehouses, operational data stores, all sorts of things to just sort of looking at data leaks, which are data Lakewood, how house and governance, all the data assets. However, I think there's a lot of very interesting patterns and a lot of interesting ways that these have been implemented. The cornerstone of a data lake really should be the catalogue that identifies where all the assets exist within the data lake, but it's not always the case that's there. And often we find that organisations will have multiple data lakes. And when this happens, you know, the data lakes don't necessarily communicate with each other and don't have a link between them. But an organisation at some point will have to report across all these various separated items. So Egeria seeks to bring an open standard to enable metadata exchange between all these silos, repositories, data lakes, applications, so not just within the BI and analytics area, but across the entire organisation. And we do this through a distributed network. We have an event bus in the middle. It allows us to buy directly share metadata. So something changed over here. We know we can be notified of it and make changes to our application here in the metadata layer to reflect that allows us to do ultimate linear analysis from the inception of a piece of data through all the various points it will touch in an organisation and trace it to its all its various resting points because it might find out to multiple. Patients and we can track it as it goes through processes. So if you got transformations in there, we can track that. We've been talking to a block chain team so we can look at creating metadata for block chain as well. And we also have information then about the provenance of the data. So a bit like that if you see the program, fake or fortune and it is very important. And then the provenance of a piece of artwork just the same way, it's very important to the provenance of your data. So if you're a data scientist, you know, first of all, you can use a capability like Egeria to identify where the data is because you have a data catalogue. And then once you've found the data, you can understand its provenance, you know, how good is that data? Is this relevant data to me? And and, you know, if you're willing to look at where it's travelled to to get to that point in time, you can say, has it been transformed? Is this the raw data? You know, is this going to be appropriate for my analysis and research?
ANDREW MCLEAN [00:11:24] I mean, if you're an organisation and you've got a new CIO or a new CTO and they look at the data, do some some people, just look at all the data they've got and just go oh
MOYA BRANNAN [00:11:35] Yes, of course. And the thing is so actually, it's quite interesting. So far, I've talked about the technical end where we map out all the data assets. So once we have all the data assets, we've got them all identified and everything. And in parallel to that, we can create a glossary on top of this. So like a business catalogue, which makes sense of that data. So once you've got a business glossary of on top of it, you can then use terms that you understand. So, for example, you know, IBM, we've got industry models. If you look at Salesforce, they've got a CIM model, which is sort of predefined glossary is that companies can use the salesforce, one that is available. I think theres open source and you can take that with its thousand items and and map that in through Egeria such that you can then sort of have business definitions for all your data assets that you have. And through that, you can sort of trace them through the the enterprise. But what's really interesting, I mean, Egeria is not a security tool. We'll come and talk about how we've done some work actually with a security, some open source security that enable access and policies to be created around the data. But if we look at using Egeria, what you can actually do is once you have a glossary and you have all your assets and you've mapped out the entire landscape and it gives you the ability as we've got auditing capabilities and we've got various uses that we can identify, gives you the availability to be able to see who can access what data, but not just access what data you can see what data did they access, when did they access that data and what other data assets did they access? So you know where data stores will hold certain amounts of information. And if you combine them together, the information they could hold could be very sensitive. So, for example, if you had some sort of customer information over here, maybe you got some product information, maybe you got some customer banking information, but it's got this information redacted this data store. But if you've got someone who is there doing some research and they start adding all these together, the picture that they can build off, that could be quite interesting. So having the available capability to be able to stitch together all these items and then to be able to sort of explore through this gives you the capability to investigate it. For example, if you had fraud, who was looking at this? What else do they look at with this? When did they do this? What picture might if they have built by their access to all these data assets? And if you understand all the data assets, you've got a very clear, quick way to do that and to explore and investigate that. It's quite interesting. On the Egeria site, we've got a whole set of personas. They use case and it talks through how fraud can be identified and how you can sort of use the Egeria tooling to explore your metadata. It's a bit like an interactive map of every day to ask that you have within the organisation.
ANDREW MCLEAN [00:14:41] I mean, do you know if an organisation hasn't just has a lot of unstructured data where would they even start? How would you even start to catalogue or or streamline this?
MOYA BRANNAN [00:14:54] So yeah, you've got to start somewhere. So we've got a lot of capabilities for doing better data discovery and capabilities to sort of build this out. But yes, it is a process of pulling this in. I mean, every application that you have has metadata in it. So if you look at a database, it has a schema thats a metadata about the data which you can then go and enrich and things. If you look at a transformation job, you've got metadata about that. If you looked at something like business objects, you know, you've got a metadata. Underneath that called the universe or is a catalogue, so every application out there does have a metadata repository. That's how you can access that. But if we just take it back to security, looking at security, one of the things that we were very keen about to look at and and use metadata to drive is the actual access to data. So, you know, when you've got this data landscape, you need to be able to put into place good governance. And so, you know, one of the things that's quite interesting is when you have data and you sort of tag that data, you know, you might say, well, this is sensitive. This is salary information. You know, we're not gonna let anyone ever see that salary information. And this is the person state birth whenever I get a letter and we'll see that. So that data might get marked redacted. And maybe there's only a very tiny or very select few people who have access to that. So you will create rolls for access and that these things are very much sort of black on white. You can see this data, this data, but not this data in this data. But the reality is, when people are actually doing research and analysis, there may be occasions where they need to see combinations of data that you've not given them access to. So we were looking at some other open source tools. We looked at one called Palisades, which allows us to do context based security access. So when I say context based is looking at a combination of things to determine if that access should be granted or not. So it's the combination of the data items together, possibly the location, because of the locations a foreign country then know we're not going to allow the people in the French team to look at the Italian data and vice versa. But if they're in the same country, then they can look at their own country's data. So we look at the combination data items, we look at the location and you can add in other things. You could add in the well, of course, the identity of the user. You can add in time as well and time of the year or whatever. So when we look at the context, we take into consideration all these factors to determine if it's appropriate for that person to have access to the data. I'll give you a scenario. You imagine a data scientist is doing two sets of analysis. This data scientist has got this sort of set level of what their eyes see. They're not lousy salary. They're not aloud to see date of birth, and they're perhaps not allowed to see the person's joining date for knowing they came to the company. The first set of analysis that they're doing is they're going to do some salary bias. So are the what is the salary bias for the employees at this company? And to do that, the first thing that need to know is what what is the employee salary? Because if they haven't got access to the salary, they can't work out. What is he paid more than this lady over here or is this lady paid more than this? And is this group of people here more likely to have a higher salary than this group over here? So its key information that they'll need. So they may need that, but they don't need the individuals identities. So they wouldn't need to know, you know, their name I.D. as an employee. They wouldn't need to know their home address because that would identify so all the features that would link it to them are things that they don't need, you know, like the mobile number, whereas it be useful for them to know their gender, perhaps their qualification level, their length of time serving at the company, the type of job they're working in so they can actually perform the analysis. So if they'd gone to requested that, they would have only perhaps perhaps got half the data back because it was said, no, you can't have this data. And the way to get around that will be to then put in a special request to get that data. So if we imagine the next query that person wants to do is a bit of research to identify a list of people who are are eligible for a free medical health check-up. So these people have to have been working at the company for over five years and then they can have a free check-up. So to do this, they'll need to look at the person's name, their email address so they can contact the best start date. And what you'll find is the data that they're now looking at. Some of it will overlap the salary query, but you know, and some of it will overlap with what they're authenticated, allowed to actually access. But each query on its own independently is is a set of data that's not actually going to sort of tell them anything that they possibly shouldn't know about the individual. So what happens with context is context will take into consideration, you know, when they request their data, you know, what is this request for? So they'll submit a context with it such that they can be then sold. Right. You can have this specific set of data because it's what satisfies your context. So where we use Egeria for this is we use it for the security officer when he's setting up all the various rides and configuring the context. Because within Egeria, we can have already gone through, we'll have identified all the data assets, we have the glossary of weather, all the data assets exist and we also have asset owner classifications. So the data, as I mentioned before, we can actually classify with labels or tags to sort of say the sensitive nature of that data and then item can have multiple tags to indicate, you know, which way and in what way it is sensitive and if it's adheres to any regulations. So the security officers job is much easier to then identify that data and sort of say what is good and what is bad. But also, he can then come up with a selection, he or she sorry can come up with a selection of purposes as as to why access should be given or not given. Which makes for a very flexible way to access your data. And also with it being driven by metadata as where metadata changes in the data systems. You will then actually have that information updated. So their classifications and the tags will be updated so that the security team can actually stay in line with what's going on in the data landscape. Because how often do we talk about silos within organisations with different groups of people here? We've got a group of security officers who you can see all the information or the metadata about the data assets, understand where it lives, where it resides, what data stores it exists in, it has a glossary for you through it. Sort of understand it from a tech business point of view as well. And they can dynamically see how that landscape is changing on a sort of minute by minute basis.
ANDREW MCLEAN [00:22:08] Well, I mean, a lot of this is healthy day to day access and how people are access data day to day. What about things a little bit more pernicious what about this insider threat or fraud or people within an organisation trying to steal data to sell to a competitor or something like that? I mean, how do companies. You've got security levels, but with the actual data itself. How do you how do you build it?
MOYA BRANNAN [00:22:37] How do you deal with a security breach? Yeah. So, I mean, we thought we sort of touched on it a little bit before talking about, you know, being able to see who's done what. So, of course, you know, this complete, you know, logging of activities within the data landscape. And one of the things that is there was actually very important for ING when they were contributing is to be able to understand the lineage, but not just the lineage as it is today, but to be able to go through to a point in time and do a historical search. You know, the lineage of the data. So to understand where the data's gone, where it's coming from, is it sensitive? But then to also be able to go in and look at who who was the person, that access that data, when they access the data, what other data they looked at with that data. So giving them visibility of who has access. So all these people have access. So after all those people have had access. Who has touched this data? Who has looked at this data? Who has manipulated to accommodate or, you know, done some interrogation with it? And when did that happen? So it's not just about the who, but the who who was actually done something and then to go and then investigate. What else have they been doing? What else have they looked at? So it's almost like being able to travel, travel to through time to a point and then go and investigate from that point, which is one of the capabilities we're working on and that we're bringing out in Egeria. And it's one that ING very keen on on having there, you know, because security is a very key thing and you need to understand who's working with your data. And, you know, have they abused the trust that you put in them or not? Because data holds so much information that we need to know. And at some point you need that flexibility to be able to get to the data, to be able to utilise it for, you know, creating reports to report on the organisation, to do research. And, you know, this is where we'll have security levels which are possibly elevated. But, you know, if we can sort of reduce that down by context, as I was mentioning, which we looked at with Palisade is one thing to do. But being able to have complete order, stability and traceability throughout your data and understanding of it, you know, gives you another level of security, might not prevent people from actually doing that, but you can actually then trace back and work out what's going on and what's happening. I mean, it will be feasible, I suppose, to actually sort of set up alerts or whatever means there is access. But for the time, for another time,.
ANDREW MCLEAN [00:25:15] You've got a lot of data to sift through. But speaking data, should we mention the the elephant in the room, the four letter acronym that brings fear into the hearts of every CIO out there GDPR.
MOYA BRANNAN [00:25:28] Yes, I think you should, it's a good one because we talked about the glossary, which sits over the top of Egeria and it gives us the business definitions of everything in the enterprise. And having a glossary is very powerful because if you define something as such as customer of our type of customer, Gold Star customer, silver customer, bronze customer, when you have the glossary, the glossary is a map to the technical metadata underneath which will identify the locations that that data actually exists. So through the metadata, you can identify all the locations where an item of data exists in your organisation. So you know, this week it's GDPR. It's all about people and someone comes along and says, oh, remove me from all your systems under the GDPR regulations. Then you've actually got a way of seeing, yeah, this is exactly where this person exists throughout my organisation is not going to remove them for you, but it will give you a view of exactly every location so you can then start going and tackling that you understand where your data resides. So, you know, it might be a case that you wanted to audit something or you wanted to search for information. And actually one thing is really interesting, if you have which a lot of organisations have systems out there that have been there for years. They just run in the background. They sit in the corner of the day centre. And not everyone's absolutely certain of what they do, what's going on and what's coming out. That's a really good way. If you've got that data, then map through Egeria. You can then begin to understand those systems, understand what data is going through them and determine, you know, do we need that system anymore? And has been used a couple of times to work out what systems can be decommissioned from that data landscape. And again, if you're integrating between multiple companies, having that view across the combined landscape, you know, is is something that I think a lot of organisations just don't have at the moment. So it's all about visibility. Once you can see where all your data assets are and understand what the relationships are between them and you're doing this through this metadata layer or the glossary gives you a huge amount of ability to to execute, to be able to do your research for our analysts, for your people carrying out machine learning on your data to be able to locate data quickly, easily validate it, utilise it, and hopefully do that in a safe way because you're auditing them. And if you're wanting to put context based security over the top of it, making sure that, you know, they've got a flexible access route to that data.
ANDREW MCLEAN [00:28:10] Well, certainly see why I called you a data and security geek, Moya Brannan Thank you very much.
MOYA BRANNAN [00:28:17] Thank you.
ANDREW MCLEAN [00:28:17] That was my Moya Brannan the from the Egeria project, you have been watching a security panel, the cyber security issue brought to you by Celerity. Thanks for joining us. See you next time.