Data library to data lake – how to harness the power of your data
Nearly every enterprise has heard the clarion call of big data these days. But the data discovery that can lead to better business decisions is easier said than done. It isn’t enough just to put in place a data warehouse or data lake. That just solves the storage problem. It’s how you find that data, ensure that it is both accurate and compliant, and make it easy to both access and use for all of the users in your organization that is the true test of a data-driven enterprise.
Storage is a commodity. Trust is much harder to achieve. And insights are impossible without a foundation of stewardship. No matter what technology is employed for your data lake itself, the potential strategic value can fall by the wayside without tooling that is specifically designed for data discovery and getting insights into the hands of your users when they need it most.
Business intelligence (BI) systems provide dashboards that offer a real-time view of what’s happening in your company. But, unless BI systems are able to monitor the complete data pipeline and surface relevant results from the many available (and always changing) data sets, they will never be able to provide a complete picture. Indeed, most BI tools are limited to a small subset of analytical tools.
The latest advances in data management software are addressing these issues, with a new generation of applications which take the management of data to the next level; beyond mere ‘management’ in fact.
The key is to be able to quickly profile and discover data across the enterprise, whether it is in an on-premise data lake, in the cloud, or coming in from real-time sensors.
The latest software supports a wide range of data sources – JSON, SQL, Parquet, Avro, raw, and compressed. Also supported are multiple, granular access controls that make the data lake easier to govern and manage.
Getting access to data, in its various forms and sources, is but one aspect of the process; being able to parse the data into something meaningful to the enterprise is quite a different, more complex challenge. Your measure of success here will provide the return on investment (ROI) in the data lake.
As all areas of an enterprise contribute to data lakes, it seems both logical and strategically advantageous if all areas, likewise, have the ability to get usable results from their own data, as well as data that’s pertinent and useful from other parts of the lake.
Everyone who works with data (not just data analysts or IT professionals) need to search, query and collaborate. And to achieve this they need data catalog solutions which can capture data from wherever it is stored and create the relationships between data sets that ultimately lead to meaning.
Achieving a single point of reference
Data in the modern enterprise is rarely unified, with no single point-of-reference discernable. Therefore all data needs initially to be considered of equal value, before business ‘weight’, or potential for use strategically can be assessed.
Data cleansing as a process is as old as the first database. Some data lake management solutions offer this service as part of the solution, others will interface with third-party solutions that specialize in this activity.
Across multiple or larger lakes, achieving business-value assessments of data becomes wildly complex and time-consuming. Some data library management solution architects are employing machine learning (ML) to ascertain robust ways to calculate the potential economic benefits for every information asset.
Usage metrics are a useful way to evaluate the impact of data assets. Put simply, if datasets are used often over time, they must have value – if not, they wouldn’t be referred to again and again. It can also be useful to see who in an organization is using the data. For example, if your Chief Data Officer (CDO) regularly references a specific data set you can safely assume that it has value. What value will, of course, depend on your specific application.
We can see, in conclusion, that assessing, preparing and cataloging data is only a starting point in the broader field of strategic data use for business improvement. Software needs to be able to work consistently across all kinds of information and be able to track that information’s impact and return over time. If data cannot be used effectively to derive valuable insights, why bear the cost of data at all?
Here are three software solution providers who can help today’s enterprise of all types navigate and benefit from today’s most precious business commodity, data:
By connecting to all of your data sources and BI intelligence systems, the Alation Data Catalog provides a single source of reference for all of the data assets in an enterprise.
Alation makes it easy to find the data you need and get the answers you trust. A searchable catalog of assets (tables, schemas, queries) is created automatically in real time. A smart query tool makes proactive recommendations as you write your queries and can be used by business users who aren’t proficient with SQL.
In addition, a combination of machine learning and human collaboration makes it possible to track and monitor how data is being used, providing insights into the relative value of that data.
One unique feature is the ability for any user, regardless of technical experience, to rank data, providing a level of grassroots governance that Alation calls “Governance for Insight”. The ability to include business rules in a catalog means that everyone in your organization that touches data will be literally on the same page. For example, your sales and marketing teams can share the same definition of ‘revenue’.
Alation provides a solid foundation for self-service analytics, business intelligence (BI) and visualization that immediately surfaces any problems that may be occurring in your data pipeline, both within and across various applications.
The Alation Data Catalog can be used either on-premises or in the cloud and it works with data lake providers such as Teradata, Kylo and HortonWorks.
Zoomdata is an Apache Hadoop data visualization tool which connects directly to SQL-on-Hadoop technologies such as Impala, Hive, SparkSQL, Presto among others.
It also is able to connect to and analyze streamed and search data (via WebSockets) and via a range of API sockets, is a truly interconnective suite of software.
With Zoomdata Fusion, data is combined from multiple sources, making it appear as a single repository. The latest sources of so-called ‘big data’ can be combined and amalgamated with a variety of relational databases’ assets and even flat files.
Cloud sources that Zoomdata is happy to interact with include Salesforce, Google Analytics, Marketo, Zendesk, and SendGrid.
The real-time nature of streamed data is reflected in Zoomdata’s GUI, using what the company calls ‘data sharpening’. When a user creates a visualization, it ‘sharpens’ with data updates as the rest of the query completes and becomes available.
The suite makes full use of Apache’s Spark, Impala, Kudu technologies, which, like Zoomdata, were intended to allow easy scaling.
Zoomdata’s capabilities allow the customization of data presentation and visualization. Apps can be constructed which feature interactive visual analytics to support a specific business process or act as a portal that delivers data and analytics to partners or suppliers.
The company provides an API for creating and managing users, groups, connections, and data sources.
Hortonworks’s platform is designed for enterprise data center deployment, where it purports to analyze both data at rest and incoming data.
The offering is based on Apache Hadoop technologies and the company, based in California, employs many contributors to the open source Hadoop product range.
Under the hood, the software includes Hadoop technologies like Hadoop Distributed File System (HDFS), MapReduce, Pig, Hive, and HBase
The Hortonworks Data Platform (HDP) product is used for storing, processing, and analyzing large volumes of data. The platform is designed to deal with data from different sources and formats. HDP creates an enterprise-level data ‘lake’ and from this, is able to uncover business insights in real time.
Hortonworks data center solutions consist of a combination of Hortonworks Data Platforms (HDP) and Hortonworks Data Flow (HDF) and promise integration with existing legacy and ingrained systems.
Hortonworks DataFlow (HDF) collects, curates, analyzes and delivers real-time data from the Internet of Anything (IoAT) – devices, sensors clickstreams, log files and so forth.
Although the company’s Q2 2017 results did not include a particularly healthy balance book, the results were an improvement on last year’s figures and caused a significant rise in the company’s stock value.