The amount of data produced across the world is increasing exponentially, and is currently doubling in size every two years. It is estimated that by the year 2020, the data available will reach 44 zettabytes (44 trillion gigabytes). The processing of large amounts of data not suitable for traditional methods has become known as Big Data, and although the term only gained popularity in recent years, the concept has been around for over a decade.
In order to address this explosion of data growth, various Big Data platforms have been conceived to help manage and structure this data. There are currently 150 different no-SQL solutions, which are non-relational database driven platforms that are often associated with Big Data, although not all of them are considered a Big Data solution. While this may seem like a considerable amount of choices, many of these technologies are used in conjunction with others, tailored for niche markets, or in their infancy/have low adoption rates.
Of these many platforms, two in particular have become increasingly popular choices: Hadoop and MongoDB. While both of these solutions have many similarities (Open-source, Schema-less, MapReduce, NoSQL), their approach to processing and storing data is quite different.
The CAP Theorem (also known as Bower’s Theorem) , which was conceived in 1999 by Eric Brewer, states that distributed computing cannot achieve simultaneous Consistency, Availability, and Partition Tolerance while processing data. This theory can be referenced with Big Data platforms, as it helps visualize bottlenecks that any solution will reach; only 2 out of 3 of these goals can be attained by one system. This does not mean that the unassigned property cannot be present, but rather that the remaining property will not be as prevalent in the platform. So, when the CAP Theorum’s “pick two” methodology is referenced, the choice is really about picking the two options that the platform will be more capable of handling.
Traditional RDBMS solutions provide consistency and availability, but fall short on partition tolerance. Big Data solutions typically lean towards supporting partition tolerance and consistency, or availability and partition tolerance.
The two Big Data solutions being compared in this article, Hadoop and MongoDB, both excel at providing partition tolerance and consistency, but do not perform as well as an RDBMS when it comes to data availability.
MongoDB was originally developed by the company 10gen in 2007 as a cloud-based app engine, which was intended to run assorted software and services. They had developed two main components, Babble (the app engine) and MongoDB (the database). The idea didn’t take off, leading 10gen to scrap the application and release MongoDB as an open-source project. After becoming an open-source software, MongoDB flourished, garnishing support from a growing community with various enhancements developed to help improve and integrate the platform. While MongoDB can certainly be considered a Big Data solution, it’s worth noting that it’s really a general-purpose platform, designed to replace or enhance existing RDBMS systems, giving it a healthy variety of use cases.
In contrast, Hadoop was an open-source project from the start; created by Doug Cutting (known for his work on Apache Lucene, a popular search indexing platform), Hadoop originally stemmed from a project called Nutch, an open-source web crawler created in 2002. Over the next few years, Nutch followed very closely at the heels of different Google Projects; in 2003, when Google released their Distributed File System (GFS), Nutch released their own, which was called NDFS. In 2004, Google introduced the concept of MapReduce, with Nutch announcing adoption of the MapReduce architecture shortly after in 2005. It wasn’t until 2007 that Hadoop was officially released. Using concepts carried over from Nutch, Hadoop became a platform for parallel processing mass amounts of data across clusters of commodity hardware. Hadoop has a specific purpose, and is not meant as a replacement for transactional RDBMS systems, but rather as a supplement to them, as a replacement of archiving systems, or a handful of other use cases.
How they Work
Traditional Relational Database Management Systems (RDBMS) are modeled around schemas and tables to organize and structure data in a combination of columns and rows. Most current systems are RDBMS, and it is probably going to stay that way for the foreseeable future. For many companies, RDBMS solutions are suitable, but not necessarily appropriate for every use case. These systems often run into bottlenecks with scalability and data replication when handling large amounts of data/data sets.
As a document-oriented database management system, MongoDB stores data in collections, in which different data fields can be queried once, versus multiple queries required by RDBMS’ that allocate data across multiple tables in columns and rows. The data is stored as Binary JSON (BSON), and is readily available for ad-hoc queries, indexing, replication, and MapReduced aggregation. Database Sharding can also be applied to allow distribution across multiple systems for horizontal scalability as needed. MongoDB is written in C++, and can be deployed on a Windows or Linux machine, but especially considering MongoDB for real-time low-latency projects, Linux is an ideal choice for the sake of efficiency. A primary difference between MongoDB and Hadoop is that MongDB is actually a database, while Hadoop is a collection of different software components that create a data processing framework.
Hadoop, as previously mentioned, is a framework comprised of a software ecosystem. The primary components of Hadoop are the Hadoop Distributed File System (HDFS) and MapReduce, which are written in Java. Secondary components are a collection of other Apache products, including: Hive (for querying data), Pig (for analyzing large data-sets), HBase (column oriented database), Oozie (for scheduling Hadoop jobs), Sqoop (for interfacing with other systems such as BI, analytics, or RBDMS), and Flume (for aggregating and preprocessing data). Like MongoDB, Hadoop’s HBase database accomplishes horizontal scalability through database sharding. Hadoop is designed to be run on clusters of commodity hardware, with the ability consume data in any format, including aggregated data from multiple sources. Distribution of data storage is handled by the HDFS, with an optional data structure implemented with HBase, which allocates data into columns (versus the two-dimensional allocation of an RDBMS in columns and rows). Data can then be indexed (through use of software like Solr), queried with Hive, or have various analytics or batch jobs run on it with choices available from the Hadoop ecosystem or your choice of business intelligence platform.
Say Hi. Don’t Be Shy.
Looking for a Big Data or Business Intelligence Solutions?
Aptude Consulting, a Big Data Systems Integrator, brings over fourteen years of Middleware and Business Intelligence experience to the world of Big Data.
As a Cloudera, MapR, Hortonworks and Oracle Gold Partner, Aptude synergizes rich, practical data warehouse experience with our partner offerings to bring a modern solution via a trusted, experienced Big Data Systems Implementer and Integrator.
Horizontal & Vertical Scalability
Vertical scaling, enhancements to server hardware such as CPU, RAM, or switching to Solid State Drives, is a common solution for RDBMS bottlenecks. There are obvious drawbacks to vertical scaling, including costs, potential down-time, and resources required to achieve the desired result. Scalable cloud-based services help offset some of these concerns, but often at a price. Horizontal Scaling is the addition of more system nodes (such as adding servers to a cluster), which is very difficult to achieve with RDBMS’, but ideal for noSQL platforms such as MongoDB or Hadoop. Not only do these platforms achieve horizontal scalability from a hardware perspective, they also perform database sharding by horizontally partitioning data into separate instances to help distribute the load.
Many large, well-known organizations use both MongoDB and Hadoop. Comparing these two in terms of popularity as a Big Data solution is difficult, because MongoDB is a more robust solution, not specifically catering to the Big Data crowd.
Currently, MongoDB is the most popular noSQL platform and seems to be on track to overtake PostgreSQL as the 4th most popular database. Keep in mind that the ranking engine mentioned previously gathers statistics of online trends and internet “buzz” to compile its results; this does not necessarily indicate higher adoption rates. Looking through the various lists of MongoDB powered applications and websites, it’s apparent that most of them are using it as a readily available/ real-time data solution. This includes real-time analytics, data availability for users on production websites, and serving up content and web pages. What seems to be lacking in MongoDB use cases are examples long-running ETL jobs, or batch jobs on complex sets of aggregated data.
Hadoop seems to have lower adoption rates than MongoDB overall as a platform, however a 2012 report indicates that it was the most popular choice for Big Data solutions the previous year. Since Hadoop is meant to operate in an ecosystem of various connectors, databases and platforms (it’s rather limited on its own), the same popularity list referenced about MongoDB may be more accurately translated by combining associated technologies like HBase, Hive, and Impala. Doing so would place it on the top 10 list, although winning a popularity contest doesn’t necessarily make any product the best choice. While MongoDB is more popular with real-time data needs, Hadoop is the more common choice when it comes to batch jobs and long running operations/analytics. This is likely due to its strengths in handling high volume, complex aggregated data feeds. It’s worth noting that real-time data analysis and adhoc queries can be available on Hadoop through other systems such as Apache Storm or Apache Hive.
Platform Strengths for Big Data Use Cases
Both of these platforms carry some of the same strengths versus a traditional RDBMS – scalability, parallel processing, MapReduce architecture, handling large amounts of aggregated data, and as open source software they both are considerably more economical choices than their other platforms that require licenses. Since they both are also architected to process data across clusters or nodes of commodity hardware, there is also a considerable savings in hardware costs.
When compared to Hadoop, MongoDB’s greatest strength is that it is a more robust solution, capable of far more flexibility than Hadoop, including potential replacement of existing RDBMS. Additionally, MongoDB also is inherently better at handling real-time data analytics. The readily available data also makes it capable of client-side data delivery (in a typical client-server web-based scenario), which is not as common with Hadoop configurations. Another strength of MongoDB is its geospacial indexing abilities, making an ideal use case for real-time geospacial analysis.
Hadoop, on the other hand, excels at batch processing and long-running ETL jobs and analysis. The biggest strength of Hadoop as a Big Data solution is that it was built for Big Data, whereas MongoDB became an option over time. An excellent use case for Hadoop is processing log files, which are typically very large and accumulate rather quickly. While Hadoop may not handle real-time data as well as MongoDB, adhoc SQL-like queries can be run with Hive, which is touted as being more effective as a query language than JSON/BSON. Hadoop’s MapReduce implementation is also much more efficient than MongoDB’s, and it is an ideal choice for analyzing massive amounts of data. Finally, Hadoop can accept data in just about any format, which eliminates much of the data transformation involved with the data processing.
Platform Weaknesses for Big Data Use Cases
Similarities also arise when comparing the weaknesses of these two platforms, including potential fault tolerance issues, security concerns, and data quality concerns. Some, if not all, of these issues are subject to various factors, such as the environment being integrated with or the technical approach/implementation taken. One primary weakness shared between Hadoop and MongoDB versus traditional RDBMS is their lack of ACID compliance, which is a guarantee that database transactions are reliably processed. Both Hadoop and MongoDB have taken heavy criticism and scrutiny over the years, in part due to their maturity and their different approach to handling data processing.
MongoDB is the subject of the most criticism because it tries to be so many different things, although it seems to have just as much approval. A major complaint about MongoDB is fault tolerance issues, which can cause data loss. A MongoDB employee was quick to refute these claims, but many instances of these issues can be found online. Additional complaints against MongoDB are write lock constraints, data aggregation issues, poor integration with RBDMS, and more. MongoDB also can only consume data in CSV or JSON formats, which may require additional data transformation.
Hadoop’s primary issue used to be the NameNode, which is a single point of failure for HDFS clusters; if the NameNode fails, so does the rest of the system. However, this issue has been addressed with the release of HDFS High Availability (HA), which provides the ability to configure two redundant NameNodes, so that the system will failover to the duplicate should a problem arise. Other concerns with Hadoop are its unpredictability with the amount of time it takes to complete data processing job, and inability to administer job priority.
As Hadoop and MongoDB mature, many of these issues will be addressed over time, with the release of updates or new software being introduced into their ecosystems. RBDMS solutions are also advancing, as well as other noSQL platforms. As strides in hardware capabilities are also made, it will be interesting to see how all these different platforms will evolve and adjust to meet the needs of the growing technology and data demands.
Through the various topics discussed in this comparison of Hadoop and MongoDB as a Big Data solution, it is apparent that a great deal of research and considerations need to take place before deciding on which is the best option for your organization. If you have requirements for processing low-latency real-time data, or are looking for a more encompassing solution (such as replacing your RDBMS or starting an entirely new transactional system), MongoDB may be a good choice. If you are looking for a solution for batch, long-running analytics while still being able to query data as needed then Hadoop is a better option. Depending on the volume and velocity of your data, Hadoop is known to handle larger scale solutions, so that should certainly be taken into account for scalability and expandability. Either way, both can be excellent options for a scalable solution that process large amounts of complex data sets more efficiently than a traditional RBDMS. In fact, it is also possible to utilize both of these systems in tandem, which is even recommended by the people at MongoDB. This combination of the two platforms addresses several of the strengths and weaknesses of each other, by delegating real-time data tasks to MongoDB, and batch data processing to Hadoop.
The case for both of these platforms as a Big Data solution has been presented, but another important factor to consider is how these solutions are implemented and integrated into your environment. Improper configurations can be disastrous to the reliability and sustainability of either of these platforms and the data they process, so choose carefully when deciding who to partner with for your next Big Data project.
Aptude provides onsite and offshore Oracle DBA support, which includes troubleshooting, back-up, recovery, migration, upgrades, and daily maintenance of Oracle database servers. Aptude has been working with our team for the past four years and we continue to use them and are satisfied with their work
Aptude provided Build.com a Java, MySQL, Webservices and other UI based solution in the business domain of analyzing and reporting on user activities for our ecommerce website. Utilizing Omniture’s APIs to download, parse, and regenerate and upload back so that we could be more effective in our marketing. I was satisfied with their project work and delivery and would consider utilizing them for future projects.” Build.com
Aptude provided us with Oracle DBA migration support, including an upgrade from Oracle 11.1 to Oracle 11.2, and the project was completed on time and to specifications. The project manager and project consultants were responsive and proactive, resulting in a successful conclusion to the work. I would definitely contract with them again, and have recommended them to other technical offices at the University of Georgia.
Thank you for the hard work your team has put forth to staff the contract positions at Wolters Kluwer. Aptude has consistently scored high in our supplier carding and even more important you are a vendor we can always trust. I am especially impressed with your ability to tackle our positions that other vendors have not been able to fill.