All the “C’s in your office (CMO, CFO, CIO) have been asking you about your Big Data Initiative or strategy and you realize that regardless of the use-case and the ROI, you’re going to need talent to execute the vision. The talent shortage in Information Technology was strained prior to the Big Data movement and our demand for savvy, technically sophisticated professionals in this new domain has exasperated it. If the Big Data resource types that you may be considering include Hadoop-related Developers and/or Architects, Hadoop Administrators and, potentially, Data Scientists, you may be asking:
- How do you make sure that the talent you bring in from outside sources have the appropriate background to be productive?
- Is there an in-house skill-set that would migrate easily to the Hadoop eco-system?
Let’s explore the answers to these questions.
First, let’s get a high-level understanding of Hadoop.
Hadoop is not a language. It is an ecosystem of multiple components that allow you to create a scalable, distributed system framework that is fault-tolerant and relatively cost efficient. Some of the components are HDFS, Hive, Pig, Map Reduce, Spark, Yarn, Sqoop, Flume, Oozie, and Mahout. (For more detail, a good starting point for reference is our introduction to Big Data.) The three major distributions of Hadoop include Cloudera, MapR, and Hortonworks.
What is the ideal technical background of a Big Data developer?
Of course, the right answer is, “it depends on the role or task(s) within the project”. We typically consider three types of resources: Architects/Developers, Administrators, and Data Scientists. Let’s take a closer look at each:
- First, Hadoop Architects / Developers. At Aptude, we believe a good place to start is Java. Many of the components of Hadoop are Java-based and a solid understanding will allow your developers the ability to extend many of the Hadoop components. Again, it depends on the organization—if your team’s skillsets are Java-rich, you will move smoothly into Hadoop development.
If Java isn’t at the core of your team’s ability, you can also consider languages such as Python, Ruby, Perl, or C#. Additionally, within Hadoop and its ecosystem is PIG, a scripting language that allows developers to create map-reduce jobs without the complexity of Java. As an aside, a developer I met at Hadoop World last October commented that when it comes to Java vs. PIG, his recommendation is “Use PIG as much as possible and when you can’t use PIG, try PIG!” A few other commonly used abstraction layers to manipulate data within HDFS are Hive and/or Impala. These components’ access methods are very similar to SQL, so one of your database SQL developers might move well into this domain.
- What type of background helps transition folks to Hadoop Administrators? As the name implies, someone with a Systems/Network Administration background would be suitable. Specific experience with LINUX will serve well since this is Hadoop’s operating system of choice, although it is worth noting that Hadoop can run in a Windows environment. Furthermore, experience in Cluster, Job, Storage and Performance Management will benefit your soon-to-be “Hadoop Administrator”. A familiarity with Java for Hadoop Administrators is always a plus, as is any experience with open source configuration tools.
- Making heads or tails of the data…or your Data Scientist. Depending on the use case, the services of a Data Scientist may be in order. What is a Data Scientist? Well, as mentioned above, “it depends”. I’m not going to try to define it as much as explain or discuss some of their attributes. A “Data Scientist”, to some organizations, may be a Data Analyst. You have Data Analysts already, don’t you? What about SQL Developers, Report-Writers, Excel Pivot Table specialists, or Visual tool specialists (Tableau/Qlikview)? These folks may transition well into the Data Scientist role. Maybe Data Analyst = Data Scientist?
We also have more of the statistician-type Data Scientists. Predictive analytics, machine learning, artificial intelligence, and actuarial science are all examples of use-case domains where the value of a statistician Data Scientist will be necessary. For these types of Data Scientists, there are specific code-based skills that they delve into, including R, Python, and SAS. On top of it all, a Data Scientist will also have some business visionary capability to see the “big picture” and may not be just a “doer”. Can you find all attributes in one person? Of course you may, but it might be a difficult proposition. A Statistician/R Programmer may be one type of Scientist and a SQL Programmer/Visual Analytics may be another type of Scientist.
How to Ramp Up on the Technology and Acquire the Necessary Skills
My first suggestion is to start “small” on Big Data. Although Hadoop may be a revolutionary technology and the Hadoop Cluster may someday be your Data-Lake and basis for all future transactions, we should view it at the moment as a “Plus-One” technology. Let’s find some low-hanging fruit to justify bringing it into the organization and start with a small cluster and a measurable objective (you can also read more at our Big Data Use Cases blog post).
My next suggestion is to bring in some form of professional outside help in the areas you will be focusing. If you have decided on a Hadoop Distribution platform (e.g. CDH, Hortonworks, or MapR) then bringing in Hadoop resources that have experience with those implementations will be of value. Furthermore, it may be helpful to bring Architect-type resources from one of the leading distributers to make sure your environment is configured properly and you get your POC off to a good start. However, this isn’t to say that all of your resources should come from the Hadoop Distributers. I say this because the price of such will be cost-prohibitive and they may or may not have interest in sharing knowledge with your team. If you find yourself a strong Hadoop Systems Integrator or Big Data Staffing Specialist who has experienced Big Data resources, then you can save the organization money while still having a very successful initiative. Depending on your organization’s in-house resource capability, it may be necessary to bring in outside Hadoop Administrators or Data Scientists as well. If this is a Proof of Concept and not necessarily part of your core long-term objectives, then utilizing a Big Data Staffing Specialist firm in a contract fashion will be your best route for you to mitigate the HR hiring/firing risk while ensuring you’re working with a firm that actually understands Big Data and the important attributes of a Big Data consultant.
Yet another approach is to augment your team’s Big Data skillset with professional training. There are many training options available and some of the Hadoop Distributors have started offering free training as an incentive to use their platform. This is a critical component of having a long-term successful Big Data initiative because some level of Big Data knowledge needs to be in-house. In addition, this newer technology training can also be part of a job-retention strategy to ensure quality IT (and related) resources stick around. An approach we see with some of our clients is to offer training to some of the technology leads and have them cross-train their colleagues informally.
Achieving the Right Mindset for Big Data
Big Data, for many organizations, is still in its pioneering phase, where an “outside of the box” thinker for all three aspects (Developer, Admin and Scientist) is important. An inquisitive personality or one who likes to “tinker” is ideal—we’re foraging the company into a new technology domain with a unique value proposition. We need resources that are willing to dabble, try/fail, and try again! What resource within your organization created their own mini cluster at home and downloaded the apache Hadoop framework to learn? Who within the organization is questioning the information and data at every instance and wasn’t satisfied with the data and analysis traditional IT provided? These types of personality mindsets will help ensure your initial Big Data project or POC is valuable, challenging, and rewarding.