Kurt Cagle is one of the advisors behind the SEMANTICS 2020 Conference. He is a writer, data scientist and futurist focused on the intersection of computer technologies and society. He is also the founder of Semantical, LLC, a smart data company. In this interview, Kurt shares with us some of his expertise and his vision for the new paths being opened by Semantic Technologies.
The SEMANTiCS Conference is now in its 16th year. We're looking back on a lot of expectations, developments, and hype—how are Semantic Technologies different today compared to ten years ago?
Kurt: I went to my first Semantic conference in 2006, the IWC conference in Japan that the WC3 used to host in different countries. I spent the afternoon the first day I got there in the hotel atrium, talking to an earnest young developer who was helping to shepherd SPARQL to recommendation status that year. Tim Berners-Lee gave the keynote (on the need to be moving to a purely declarative web, something which never really did come to pass because everyone was still enamored with Javascript and web developers (and more to the point, web vendors) were doing their best to make the world "all Javascript, all the time". There was a lot of academic interest in Semantics but very, very little of it was focused on business cases.
My next conference, one of Tony Shaw's before he founded Dataversity, was in 2009. The conference was fairly active - Apple announced this new concept called Siri that was to debut later that year, an early software agent system that, according to the speakers, would revolutionize the way we interacted with the online world. The papers were still primarily academic, but a few companies, such as the BBC, were beginning to use semantics to help drive and manage their sports reporting, OWL was hot, and SPARQL was beginning to reshape the way that people interacted with graph data.
By my third Semantics conference, in 2011, I was at another of Tony's conferences, but it was disturbingly quiet there. The number of attendees only slightly outnumbered the number of people doing booth duty in the exposition hall, which was barely a third used. No one was interested. The big news stories were around Hadoop and Big Data, the future was data lakes, and Java programmers, who'd had a fairly long period where not much was happening in the field, were suddenly commanding the big salaries all over again. Semantics was considered to be a failed tech and most of the ambitious young companies that had started out the field were now gone, their developers having long since fled to more lucrative pastures.
You can point to Gartner's hype cycle curve for what happened between 2009 and 2011. Semantics was too far ahead of the curve. It was too complex, too academic, too slow, and too sufficiently outré to be of interest to the business community. It was too declarative at a time when most programmers were too focused on imperative mode programming. It used big Greek words like ontology and semantics. There were no real killer apps. It hyped and it crashed, maybe not as bad as the AI winter of the 1970s, but bad enough.
I pin 2013 as the year of the second coming of Semantics. While the industry was still reeling, 2013-4 was when everything changed. The W3C released SPARQL 1.1 and SPARQL Update, the first giving a serious boost to SPARQL while the second made it possible to modify SPARQL databases dynamically in a standard way. I liken this to the SQL world where SQL 87 codified the standard for relational database query, but it would take SQL 89 and the adoption of a DDL for SQL to take off. Hadoop was running out of stream - commodity map / reduce grid processing was (and is still) very cool, but Hadoop got ahead of itself and its backers tried to make it into a database platform, despite it being inferior to almost anything on the market, because frankly there's very little money to be made in grid processing in comparison to selling database solutions, and the new corps of data scientists were discovering that quality of data really did matter more than quantity.
A consequence of Hadoop's failure was that managers discovered that simply aggregating data was not enough. You still needed to organize it, to provide metadata, governance, and trust. What's more, Hadoop did open the gates to the idea that data could be a service, and over that time a whole generation of alternative databases that had been largely stuck in the shadows now came to light. Meanwhile, we've had roughly twenty years of performance improvements, which means that in practice commodity computers are now about a thousand times faster than they were when the Semantic Web was first proposed, and are far more interconnected. This means that the multi-layered indexed joins which are a staple of queries on graph databases are now roughly on par with the state of the art in relational database design in 2013. For smaller data sets, relational will usually be faster, but for the first time, in queries involving billions of triples, graph databases are becoming more efficient than their relational counterparts, and the ability for SPARQL to retrieve structured documents (via CONSTRUCT statements) as well as tables is beginning to take some of the wind out of the document market as well.
Finally, organizations are finally beginning to understand the value of their metadata. I don't know if I coined the term Metadata Management, but I do know I was one of the first people to use it, in a presentation I gave at one of the Henry Stewart Digital Asset Management conferences in 2014. We tend to think of enterprise data models as something going back decades but in point of fact the idea that an organization should move towards a consistent language to describe their assets and properties is actually fairly recent, and that mostly due to the use of XML in standards such as NIEM from around 2005. Enterprise data is not application data, and is often, in fact, the metadata that applications use. As more and more enterprises find themselves swimming in data, they're looking at where the success stories are for managing them, and increasingly, that's in the realm of graph.
The SEMANTiCS Conference has become the primary gathering for the industry to discuss Semantic Technologies. How do you see the uptake of Semantic Technologies in the industry today?
Kurt: My opinion, which is admittedly biased, is that the double twenties will be the decade of the graph. Part of this, a big part of this for what it matters, is because intelligence, in the sense of computer intelligence, not necessarily AI in the most common parlance, is becoming both distributed and mobile. That's a graph of intelligent agents and edge computing, constantly moving around and changing linkages. These dynamic, real-time networks will be passing around globally unique identifiers because they are in a global context, rather than being records in a closed database, and they will need to provide discovery in a consistent manner.
Distributed ledgers are also an example of distributed graphs, even though when blockchain was first conceived, it wasn't built around graph tech. It just turns out that once you have unique, verifiable identifiers (which is what blockchain does), then you have the ability to distribute both the relevant metadata about a resource outside of a blockchain and you have the means to discover additional information. For instance, consider a car or truck. That vehicle has a globally unique identifier, the vehicle identification number, or VIN. Your car knows it's own VIN. The service station that fixes your car can query your car for its diagnostics data, tied into its VIN and they retain their own records tied into the VIN. The insurance company has your VIN, and they have their records associated with that VIN. The police have your VIN if you've ever been pulled over. In other words, there's a graph that connects your vehicle to these other entities.
Cars are unusual today in that they have a surprisingly comprehensive virtual record, but that comes at immense duplication of information, much of that manually re-entered, over and over again. In most cases, Extend this to drones, to autonomous vehicles, to resources in supply chains, to art, food shipments and so forth. Anything in the IoT space will have an address, will be on the edge of the computing mesh, and may very well be mobile. All of these become trace a graph in space and time, and it's increasingly obvious that relational systems are not up to the task because they are constrained by referential integrity.
It won't happen overnight, however, though it may happen faster than people in the industry might expect. Part of the reason for this is the fact that it has been twenty years since the last major revision of data systems globally, which occurred because of the Y2K crisis. Many of those data systems are now twenty years old, on failing equipment, and increasingly tied into applications that will fail one after the other as their back end data systems collapse. The track records of companies now facing digital transformations - many are still attempting to move towards an enterprise warehouse approach just to preserve the data from existing systems, but this is not moving them any closer towards the grandiose mission statements that get promulgated in press statements.
What I'm seeing in industry right now is that there's a leading wave of companies, mostly from the Fortune 500, that started knowledge graph pilot projects in the last three years, and are now successfully moving beyond these pilots into full implementations. The results of these second-stage implementations are beginning to go public but from anecdotal conversations, companies are finding significant value in knowledge graphs in particular, with data hubs and data catalogs rounding out the top three tiers. In this second wave, one thing that is beginning to emerge is a best practices case for employing machine learning as part of the classification process - not so much to establish the modeling but rather to determine once the initial models are made how instance data should be categorized into those models. Amazon and United Health Group are two companies right now that are doing some very interesting things in this space.
Something else that is happening - and Neo4J has been leading the charge here - has been in seeing knowledge graph patterns as patterns - looking at the increasing optimization by recognizing that graph topologies tend to be largely topically independent. Put another way, you don't necessarily need to understand the problem domain in order to take advantage of recurring patterns. This isn't that surprising - XSLT worked on much the same principle, and SHACL is a tacit admission that context-free patterns may very well be the way forward for a number of problems.
Speaking of SHACL, I'm going to go out on a limb (though I've written about this before) and say that I believe SHACL may very end up replacing OWL in all except high inference requirement situations. OWL is a powerful language, but jumping so early to OWL meant that deep inferential systems over-dominated the projects coming out of academia, while the value of graph technology as, well, a database, tended to languish. In many respects, I think that the increased interest by the corporate sector - especially in areas such as fintech, supply-chain-management, digital asset management and so forth - is forcing the industry to step back from inferencing, or at least push it closer to being leaf operations, SHACL is part of that move. It is not as flexible as OWL in some respects but SHACLemerged post-SPARQL and address issues that are more familiar to information architects and CTOs versed with OMG or XML-style schema languages than ontologists. The fact that SHACL can do a surprisingly good job of managing both reporting and user interface generation (with SPARQL providing the oomph in the back end), is another real plus, and I suspect that this will have a huge impact in areas such compliance management and testing. OWL won't go away, but I think its influence is waning.
The other major thing I see happening by 2023 is SPARQL 2. I know there's not necessarily a lot of appetite for it in the W3C, but there are a number of problem areas that SPARQL needs to address. Juan Sequeda, the chief data scientist at data.world, and I had a talk at the Graph conference in Chicago this last October, focusing at least in part on RDF*, which includes among other things the formal recognition of reified types as a class within the core RDF model, and consequent updating of SPARQL to reflect structures such as {?s ?p {?ss ?pp ?oo}}. One huge benefit of this is that it brings the semantic and property graph models closer together, as predicate attributes are, if you look at them right, actually reification attributes. Another area that needs to be addressed is the expansion of property graphs to allow variable components in SPARL statements, such as {?a ?b+ ?c} or even {?a ?b{1,5} ?c} which would allow you to do generate graph-traversal paths between two non-contiguous nodes. Stardog's proposal for SPARQL stored procedures would be a welcome addition as well, especially if they could also be tied into an implementation model where users could expand the language via Javascript, Java, Python or even (cough) Scheme. These were discussed during the build-out of SPARQL 1.1 in 2011 to 2013, but the technology was not yet robust enough to implement that functionality. I don't believe that's a limitation any more.
One final note - I'm paying very close attention to what's happening with JSON in this space. JSON-LD has brought semantics to the Javascript community, though with the exception of the context object I worry that the four profiles of JSON-LD are more complicated than they're worth; the JSON-LD community would do good to standardize on the compact profile, and deprecate the other formats until a clear use case for their need manifests. I've also put GraphQL on my watchlist - it's helping to finally consolidate a query language in the JSON space and purely coincidentally, happens to work very well with graph databases. I've been spending a huge amount of time of late in Javascript land over the last year working on Gracie, and GraphQL was right at the top of several of our customers' wish lists). I am increasingly of the belief that GraphQL will actually be the tool to bring the Semantic Web to the Web. (Personal plug here, Gracie is the Graph Curation & Information Exchange, a way of putting knowledge base tools into the hands of non-ontologists for the purpose of building out information exchanges, and I recently co-founded a company, Semantic Data Group, to take it to market).
So, yeah, the next several years should be exciting in this space.
Kurt Cagle is the CTO of Semantic Data Group, LLC, and writes on LinkedIn and elsewhere under the #theCagleReport hashtag. When not writing about semantics, data modeling or the state of the industry, he writes science fiction novels, does 3D graphics and feeds the cat. Kurt lives in Issaquah, WA, and can be reached at kurt.cagle@gmail.com or on LinkedIn at https://linkedin.com/in/