TechTarget.com/searchdatamanagement

https://www.techtarget.com/searchdatamanagement/tip/Three-ways-to-turn-old-files-into-Hadoop-data-sets-in-a-data-lake

Three ways to turn old files into Hadoop data sets in a data lake

By David Loshin

As organizations reconsider their data architectures to enable new analytics applications, we're seeing a transition in which traditional data warehouses are being augmented by big data environments. And while Hadoop data lakes typically hold new types of data, they can also be a suitable repository for older information that has analytical value waiting to be tapped.

One of the reasons why Hadoop systems are being integrated with data warehouses is to move cold data that isn't accessed frequently from a warehouse database to Hive tables running on top of the Hadoop Distributed File System (HDFS). This mingling of conventional databases with Hadoop is often a first step in the data modernization process, and it opens up a range of new options for creating useful Hadoop data sets.

A particularly promising aspect involves migrating the massive volumes of historical data hidden away in many data warehouses to big data environments to make the info more accessible for analysis. In a lot of cases, that data is stored in mainframe files, such as VSAM, IMS and COBOL files. When planning a legacy data migration to a data lake, you have to consider the different alternatives for the target format based on the anticipated use cases for the data.

For example, if the goal is purely to move data off of an aging platform and onto a more modern one for ongoing storage, the sensible approach might be to simply copy the mainframe files to HDFS and take advantage of the redundancy and fault tolerance that Hadoop clusters provide. Once all the files are moved, you can move on to preparing the old system for retirement.

More than a data salvage job

However, there's growing interest in not just salvaging legacy data, but putting it to productive use. Many older data sets include years of transaction data or operational logs that can be subjected to various forms of advanced analytics, such as time series analysis and machine learning algorithms, to look for patterns in the data that can help predict future trends and business opportunities.

If that's your goal, copying the existing files as is won't be sufficient, unless your analytics applications are engineered to read them in the original mainframe source format. The question then becomes this: How do you transform the legacy files into Hadoop data sets that are suited to modern-day analytics?

Let's look at three alternatives for converting mainframe files to formats that can support broader analysis of the data.

So, which is the best way to go? That depends largely on the kinds of analytics uses an organization has in mind. But it's important to recognize that data modernization usually doesn't happen overnight. In fact, over time, as the community of data users grows and their analytics needs expand, there may be opportunities to take advantage of each of these approaches to maximize the value of the resulting Hadoop data sets.

01 Feb 2018

All Rights Reserved, Copyright 2005 - 2025, TechTarget | Read our Privacy Statement