Three ways to turn old files into Hadoop data sets in a data lake

TechTarget.com/searchdatamanagement

https://www.techtarget.com/searchdatamanagement/tip/Three-ways-to-turn-old-files-into-Hadoop-data-sets-in-a-data-lake

Three ways to turn old files into Hadoop data sets in a data lake

By David Loshin

As organizations reconsider their data architectures to enable new analytics applications, we're seeing a transition in which traditional data warehouses are being augmented by big data environments. And while Hadoop data lakes typically hold new types of data, they can also be a suitable repository for older information that has analytical value waiting to be tapped.

One of the reasons why Hadoop systems are being integrated with data warehouses is to move cold data that isn't accessed frequently from a warehouse database to Hive tables running on top of the Hadoop Distributed File System (HDFS). This mingling of conventional databases with Hadoop is often a first step in the data modernization process, and it opens up a range of new options for creating useful Hadoop data sets.

A particularly promising aspect involves migrating the massive volumes of historical data hidden away in many data warehouses to big data environments to make the info more accessible for analysis. In a lot of cases, that data is stored in mainframe files, such as VSAM, IMS and COBOL files. When planning a legacy data migration to a data lake, you have to consider the different alternatives for the target format based on the anticipated use cases for the data.

For example, if the goal is purely to move data off of an aging platform and onto a more modern one for ongoing storage, the sensible approach might be to simply copy the mainframe files to HDFS and take advantage of the redundancy and fault tolerance that Hadoop clusters provide. Once all the files are moved, you can move on to preparing the old system for retirement.

More than a data salvage job

However, there's growing interest in not just salvaging legacy data, but putting it to productive use. Many older data sets include years of transaction data or operational logs that can be subjected to various forms of advanced analytics, such as time series analysis and machine learning algorithms, to look for patterns in the data that can help predict future trends and business opportunities.

If that's your goal, copying the existing files as is won't be sufficient, unless your analytics applications are engineered to read them in the original mainframe source format. The question then becomes this: How do you transform the legacy files into Hadoop data sets that are suited to modern-day analytics?

Let's look at three alternatives for converting mainframe files to formats that can support broader analysis of the data.

File-to-file transformation. In this approach, the original legacy files are transformed into a modern format, such as an ASCII text file, and the original collection of data instances is maintained within the new files. It's suited to a simple data lake environment in which Hadoop data sets are stored and managed as singular data assets.
In addition, this option can uphold existing types of data usage that don't rely on new data management and analytics methods -- for example, cases in which users are most likely to request large-scale data extracts that involve simple scans through sets of records. As a result, organizations that are beginning their journey to data modernization may elect this course as the path of least resistance internally.

SQL-based storage. A different approach is to leverage SQL-based data engines that are layered on top of Hadoop and other big data platforms. Examples include Hive, Impala, Presto, Drill, IBM Big SQL and Spark SQL. The SQL-on-Hadoop alternative is a little more complex, as it requires introducing data integration processes to transform the legacy data into a format that can be loaded into HDFS or another repository, and then queried via SQL.
However, once that has been done, users can run standard SQL queries to access and analyze the data. This option supports applications for dimensional analysis, as well as direct ad hoc querying and even some level of automated data services.

Individual object storage. The third way involves transforming each data instance into its own object using a representation created with technologies such as XML and JSON. Doing so provides much greater flexibility when it comes to analyzing Hadoop data sets because a collection of objects stored in a clustered data lake architecture provides a natural landscape for executing MapReduce or Spark programs that can scale across distributed data.
This option may require extra work to transform the legacy data into the target format. However, the resulting repository in HDFS or elsewhere is capable of supporting requests for data extracts, a mix of advanced analytics applications, and data services in which specific API invocations can be launched by accessing requested objects based on a name or meta tag.

So, which is the best way to go? That depends largely on the kinds of analytics uses an organization has in mind. But it's important to recognize that data modernization usually doesn't happen overnight. In fact, over time, as the community of data users grows and their analytics needs expand, there may be opportunities to take advantage of each of these approaches to maximize the value of the resulting Hadoop data sets.

01 Feb 2018