Hadoop is a large-scale distributed batch processing infrastructure. While it can be used on a single machine, its true power lies in its ability to scale to hundreds or thousands of computers, each with several processor cores. Hadoop is also designed to efficiently distribute large amounts of work across a set of machines.
Orders of magnitude larger than many existing systems work with. Hundreds of gigabytes of data constitute the low end of Hadoop-scale. Hadoop is built to process “web-scale” data to the magnitude of hundreds of gigabytes, terabytes, or petabytes. At this scale, it is likely that the input data set will not even fit on a single computer’s hard drive, much less in memory. So Hadoop includes a distributed file system which breaks up input data and sends fractions of the original data to several machines in your cluster to hold. This results in the data being processed in parallel using all of the machines in the cluster and computes output results as efficiently as possible. Many big players in the IT industry are using Hadoop to solve their problems. Let’s take a look at some of the Hadoop success stories and how they were implemented.
With tens of millions of users and more than a billion page views every day, Facebook ends up accumulating massive amounts of data – probably more so than your typical organization, especially considering the amount of media it consumes. One of the challenges that Facebook has faced since the early days is developing a scalable way of storing and processing all these bytes since using this historical data is a very big part of how they can improve the user experience on Facebook.
Years ago, Facebook began playing around with the idea of implementing Hadoop to handle their massive data consumption and aggregation. Their hesitant first steps of importing some interesting data sets into a relatively small Hadoop cluster were quickly rewarded as developers latched on to the map-reduce programming model and started processing data sets that were previously impossible due to their massive computational requirements. Some of these early projects have matured into publicly released features like the Facebook Lexicon, or are being used in the background to improve user experience on Facebook.
Facebook has multiple Hadoop clusters deployed now – with the biggest having about 2500 cpu cores and 1 PetaByte of disk space. Facebook is loading over 250 gigabytes of compressed data (over 2 terabytes uncompressed) into the Hadoop file system every day and have hundreds of jobs running daily against these data sets. The list of projects that are using this infrastructure has proliferated – from those generating mundane statistics about site usage, to others being used to fight spam and determine application quality. An amazingly large portion of their engineers have run Hadoop jobs at some point.
The rapid adoption of Hadoop at Facebook has been aided by a couple of key decisions. First, developers are free to write map-reduce programs in the language of their choice. Second, Facebook has embraced SQL as a familiar paradigm to address and operate on large data sets. Most data stored in Hadoop’s file system is published as Tables. Developers can explore the schemas and data of these tables much like they would do with a good old database. When they want to operate on these data sets, they can use a small subset of SQL to specify the required dataset. Operations on datasets can be written as map and reduce scripts or using standard query operators like joins and group-bys or as a mix of the two. Over time, Facebook has added classic data warehouse features like partitioning, sampling and indexing to this environment. This in-house data warehousing layer over Hadoop is called Hive and they are looking forward to releasing an open source version of this project in the near future.
At Facebook, it is incredibly important that they use the information generated by and from their users to make decisions about improvements to the product. Hadoop has enabled them to make better use of the data at their disposal.
Like their arch-rival Amazon.com, the soon-to-split eBay corporation is something of an oddity in that it hasn’t historically been a big contributor to the open-source community. But the e-commerce pioneer hopes to change that with the release of the source-code for a homegrown online analytics processing (OLAP) engine that promises to speed up Hadoop while also making it more accessible to everyday enterprise users.
Dubbed Kylin, the platform was developed after eBay failed to find a solution to help it effectively address the rapid growth in the volume and diversity of data generated by its customers, a story that is familiar to other contributors to the Hadoop community. Kylin optimizes the storage of information by leveraging existing technologies whenever possible from the upstream component ecosystem.
By default, data is stored in Apache Hive, which layers a familiar SQL interface on top of Hadoop that allows business workers to harness the distributed analytics capabilities of the system without having to learn the nuances of the native MapReduce execution paradigm. When Kylin comes across certain repetitions in the rows and columns inside the sub-project – such as a particular product appearing multiple times with different prices – it maps that data into key-value pairs which are then whisked off to Apache Hive, which is another component designed with that specific type of workload in mind.
Specifically, Hive provides random access to information that Kylin exploits to avoid having to sequentially scan tens or hundreds of billions of rows in Hive whenever an eBay employee looks up a certain business detail. That has helped to significantly improve response times at the company, with eBay claiming that the technology handles certain queries in less than a second, allowing truly interactive analytics.
Topping off that performance advantage are a number of complementary features such as integration with popular business intelligence tools like Tableau Inc.’s wildly popular data visualization platform, storage compression and monitoring. Future versions of Kylin will also add better support for more processing paradigm, eBay promises, including multidimensional and hybrid OLAP.
Oracle customers are facing a big data problem, and Hadoop has become the answer – reluctant as Oracle is to admit it. Speaking at the Oracle product and strategy update in London, Oracle president Mark Hurd said that the company’s customers are growing their data up to 40% a year, putting tremendous pressure on IT budgets.
“Growth of 40% data with customers who spend $10,000 a terabyte to house the data; most of our customers spend 10 of their IT budgets on storage, and if you take those three numbers and put them together you’re going to grow your IT budgets 3-5% just housing the data,” said Hurd.
Oracle offers a range of products to help customers shrink their data, such as Oracle Exadata Database Machine, Oracle Exalogic Elastic Cloud and Oracle Exalytics, said Hurd.
“If you’ve got a terabyte of data, we in many cases can shrink it 10x – turn a terabyte into 100GB – and work on ways they can consolidate storage, servers, databases, shrink their footprints, get their sustainability levels up, while at the same time innovating.”
However, when it comes to dealing with real Big Data problems, Oracle still relies on Hadoop, the open-source software framework licensed under the Apache v2 license. Back in January 2012, Oracle announced a joint agreement with Cloudera – a distributor of Hadoop-based software and services for the enterprise – to provide an Apache Hadoop distribution and tools for Oracle Big Data Appliance.
Oracle Big Data Appliance is an engineered system of hardware and software that incorporates Cloudera’s Distribution Including Apache Hadoop (CDH) and Cloudera Manager, as well as an open source distribution of R, Oracle NoSQL Database Community Edition, Oracle HotSpot Java Virtual Machine and Oracle Linux running on Oracle’s Sun servers. Together with Oracle’s other database offerings, the company claims that Oracle Big Data Appliance offers everything customers need to acquire, organise and analyse Big Data within the context of all their enterprise data.
Speaking to Techworld at IP Expo earlier in the day, Doug Cutting, creator of Hadoop and chief architect at Cloudera, said that the company works with Oracle in many markets, as well as with other big technology companies like IBM and Microsoft.
Salesforce.com is the premier cloud computing service provider for the enterprise. It provides several popular services such as Sales, Service, Marketing, Force.com, Chatter, Desk, and Work to over 130k customers, and millions of users. These services result in over a billion transactions per day accessed through multiple channels – API, Web and Mobile. Gathering events (or clickstream or interactions) in a central location is one of the key advantages of being a cloud provider. This event data is extremely useful for internal and product use cases. This in turn enables users to have a better overall experience of using Salesforce.com services.
Events are gathered through application instrumentation and logging. For each logged event, Salesforce collect interesting information about the interaction- organizationId, userId, API/URI/Mobile details, IP address, response time, and other details. From an internal perspective, it provides the ability to troubleshoot performance problems, detect application errors, measure usage and adoption for the Salesforce code base, as well as for custom applications built on our platform. Many internal users at Salesforce use logs on a daily basis. Key user groups are the R&D teams, Technical Operations teams, Product Support, Security, and Data Science teams. We use a combination of log mining tools such as Splunk- for operational use cases, and Hadoop- for analytic use cases. Salesforce have used Hadoop successfully for internal as well as product use cases. Internal examples include product metrics, capacity planning and product examples include Chatter file, user recommendations, and search relevancy.
Product metrics is important for Product Managers to understand usage and adoption for their features. It also provides the ability to the executive leadership team to make decisions based on trends. Not surprisingly, it is often as important to kill unpopular features as it is to invest in popular features. With product metrics, the goal was to define features, their log instrumentation, a standard set of metrics, measure, and visualize. Salesforce used Custom Objects on the Force.com platform to record feature definitions, and log instrumentation. While they use Pig extensively to mine through the logs on an adhoc basis, they didn’t think it worthwhile or productive for every Product Manager to write their own Pig scripts for their features. So Salesforce wrote a custom Java program to auto-generate Pig scripts based on the pre-defined feature instrumentation. These scripts ran on the Hadoop cluster aggregating data, and storing the daily summaries in another Custom Object. Finally Salesforce used Salesforce.com Reports and Dashboards to visualize this data in a self-service way for PMs and executives.
Hadoop and other associated big data technologies are important to their success. Salesforce.com is active in the open source community with many contributions to Pig and HBase.
Hadoop is a top level Apache project, initiated and led by Yahoo!. It relies on an active community of contributors from all over the world for its success. With a significant technology investment by Yahoo!, Apache Hadoop has become an enterprise-ready cloud computing technology. It is becoming the industry de facto framework for big data processing.
The Hadoop project is an integral part of the Yahoo! cloud infrastructure — and is the heart of many of Yahoo!’s important business processes. Yahoo run the world’s largest Hadoop clusters, work with academic institutions and other large corporations on advanced cloud computing research and their engineers are leading participants in the Hadoop community.
Hadoop with security is a significant update to Apache Hadoop. This update integrates Hadoop with Kerberos, a mature open source authentication standard.
Hadoop with security:
- Prevents unauthorized access to data on Hadoop clusters
- Authenticates users sharing business sensitive data
- Reduces operational costs by consolidating Hadoop clusters
- Collocates data for new classes of applications
Oozie, Yahoo!’s workflow engine for Hadoop is an open-source workflow solution to manage and coordinate jobs running on Hadoop, including HDFS, Pig and MapReduce.
Oozie was designed for Yahoo!’s complex workflows and data pipelines at global scale. It is integrated with the Yahoo! Distribution of Hadoop with security and is a primary mechanism to manage complex data analysis workloads across Yahoo!.
These success stories have been prime examples of why Hadoop has been gaining popularity as a big data solution. Although these companies are obviously big data champions, there are plenty of other use cases in which any enterprise can take advantage of big data platforms.
If you would like to discuss your Big Data needs with Aptude, please do not hesitate to use the contact us link below.