Big data analytics projects are on many organization “to-do” lists, which isn’t surprising given the many benefits that can be realized from an implementation. As with any new technology implementation, it would be unwise to haphazardly introduce any big data solution to your technology ecosystem without properly analyzing and preparing for this kind of project.
Big data can’t prove its business value if it remains in a perpetual proof of concept phase, without the availability, security, backup and recovery, and other reliability aspirations we take for granted on our traditional transactional systems. How can you prepare your big-data deployment for delivery into a production IT environment? And what exactly does it mean to say that big data, or any IT initiative, is truly ready for production?
Production readiness means that your big-data investment is prepared to realize its full operational potential. If you think “productionizing” can be done in one step, such as introducing HDFS NameNode redundancy, then you may want to take a step back and re-evaluate the situation. Productionizing asks a lifecycle focus that encompasses all of your big-data platforms, not just a single one (e.g., Hadoop/HDFS), and lists more than just a single requirement (e.g., ensuring a highly available distributed file system).
There’s many different proper (and improper) ways to implement a Big Data solution into your environment. What’s required is a list of steps that big data analytics project managers should take to help set their programs down the right path, one that leads to the expected business value and a strong return on investment.
Productionizing involves jumping through a series of procedural hoops to ensure that your big-data investment can function as a reliable business asset. These are several high-level considerations to keep in mind as you prepare your big-data initiative for prime time deployment:
1. Choose your big data platform.
There’s a large selection of big data platforms available on the market these days, and the list keeps growing. Choosing the right one for your organization will be challenging, but it will ultimately come down to your requirements. Two of the most popular choices are Hadoop & MongoDB (you can read our comparison of the two platforms here). Hadoop tends to perform better with long-running ETL jobs, while MongoDB tends to handle adhoc data better. Regardless of what platform you choose, big data solutions typically are an ecosystem of software packages that can be modular for their intended purposes.
2. Outsource or insource?
Let’s face it – finding and retaining talent specializing in big data technologies is an immense challenge. Adding onto this burden is the amount of time it takes to procure the aforementioned resources. Of course, there are pro’s and con’s to both strategies, but if your organization isn’t opposed to using a 3rd party that specializes in big data consulting, it may be a valuable option for you to consider. Another option is a hybrid solution, in which you engage consultants for the architecture and training of in-house staff, while retaining consultancy for critical roles that may not be suited for existing resources.
3. Set realistic expectations.
In organizations that are new to big data projects, high expectations can be set by technology vendors that claim big data tools are easy to use and point to other enterprises that have gained significant business value by using them. Every organization has its own intricacies, so there is no template to success when it comes to big data implementations. Big data solutions can handle many use cases, just make sure that you pick the right ones for your company, and have an understanding of the limitations of big data.
4. Harden your big-data technology stack.
Strategize hardening for your databases, middleware, applications, tools, etc. – to address the full scope of SLAs associated with the main use cases. If the big data platform you choose does not fulfill the availability, security and other main requirements expected of most enterprise infrastructure, it’s not a viable solution. Preferably, your big-data platform should benefit from a common set of enterprise management tools. Key guidelines regarding this are:
- Leverage your big-data solution provider’s high availability, security, resource utilization, mixed-workload management, performance boost, health monitoring, policy management, job planning and other cluster management features;
- Have a roadmap for high availability on your big-data clusters by implementing redundancy on all nodes, with load balancing, resynchronization and hot standbys;
- Prepare thorough regression testing plans of every layer in your target big-data deployment prior to going live, make sure your data, jobs and applications won’t crash or encounter bottlenecks in daily operations; and
- Avoid repositioning big-data analytics jobs to your clusters until you’ve hardened the latter for 24×7 availability and ease of configuration and administration
5. Architect your environment for scalability to keep pace with your organization’s data growth.
If you can’t supply, add, or reallocate new storage, evaluate and network capacity on the big-data platform in a quick, cost-effective, modular way to meet new requirements, the platform is not ready for production. Key guidelines in this respect are:
- Scale your big data using scale-in, scale-up and scale-out techniques.
- Speed up your big data with workload-optimized integrated systems fit for cloud deployment.
- Optimize your big data’s distributed storage layer, and
- Retune and rebalance your big data workloads continuously.
6. Create a strategy for data availability and analysis.
Big data platforms need to integrate with your environment not only with ingestion, but also with business intelligence. From adhoc data or long running analytics or ETL jobs, have a plan for which technologies will handle this data analysis and how your data scientists will utilize the data. This important step will help you realize immense cost and efficiency benefits.
To the extent that your enterprise already has a mature enterprise data warehousing (EDW) program in production, you can use that as the template for your big-data platform. There is no need to redefine “productionizing” for big data’s sake.
Clearly, there are both big risks and big rewards in undertaking a big data analytics project. But with genuine attention to sound project management practices, project managers and their teams can reduce the downsides and make deployments a big business opportunity for their organizations.