When considering a data science project, the names which get thrown around can feel a bit like word salad. Kafka? Spark? Keras? What are all these words, and why are they so important to data science?
In this blog, we’ll present the why of programming frameworks and then introduce you to 31 programming frameworks and interfaces which are often used in data projects. Finally, we’ll show you how you can get help from us if you’re interested in learning more about how to explore data science at your organization.
What is a Programming Framework?
But first, it’s important to understand what a programming framework is. Most people probably have a basic understanding that programming involves writing lines of code. However, writing lines of code from scratch for each and every project is tedious. Frameworks and libraries shorten the creative process and allow programmers to take advantage of tried-and-true programmatic solutions to common problems.
This is especially true for data science, where the problems to be solved are so impactful that they can’t be left to error… and so complex that starting from scratch each time can be very time-consuming.
Frameworks and libraries are essentially starting blocks for creating code; these code blocks have been built, tested, and optimized by a community.
Three Benefits of Frameworks for Data Scientists
Frameworks offer many benefits to data scientists and the technology teams they work in.
- Frameworks Create Better code. Frameworks help coders create better design patterns and avoid duplicate or insecure code. The subsequent code is easier to write, easier to test, and easier to debug.
- Frameworks Are Pre-Tested and Pre-Optimized. Data science teams can save themselves time by using pre-tested and pre-optimized code rather than starting from scratch.
- Faster Implementation. The implementation runway is shorter when teams use code that’s been heavily documented, tested, and optimized. Teams can spend less time designing and testing and more time analyzing and optimizing the models.
Next, we’ll introduce you to 31 common data science frameworks (and interfaces) which you’ll hear in the data science world.
31 Data Science Frameworks and Interfaces
1. Apache Kafka
Apache Kafka is an open-source, scalable messaging platform built on Java and Scala and created by LinkedIn. As a streaming platform (“ingestion backbone”), it’s capable of handling trillions of events a day in realtime. Kafka is used as a data science framework in projects that require accessing and handling very large amounts of real-time data.
Learn more about Apache Kafka at https://kafka.apache.org/.
2. AWS Deep Learning AMI
While not a framework per se, AWS Deep Learning AMI is a tool that allows data scientists to work faster and better. According to Amazon, “The AWS Deep Learning AMIs provide machine learning practitioners and researchers with the infrastructure and tools to accelerate deep learning in the cloud, at any scale.” At the time of this writing, the AWS DL environments come pre-configured with TensorFlow, PyTorch, Apache MXNet, Chainer, Microsoft Cognitive Toolkit, Gluon, Horovod, and Keras.
Learn more about AWS Deep Learning AMI at https://aws.amazon.com/machine-learning/amis/.
Bokeh is an open-source Python data visualization library used to create interactive, scalable visualizations inside browsers. With Bokeh, the “interactivity” is the important part, and the reason data scientists love using it for visualizations. Bokeh is built in layers, first starting with figures, then elements, and then finally glyphs. After that, “Inspectors” can be added to enable user interaction.
Learn more about Bokeh at https://bokeh.org/.
Caffe (now Caffe2, a part of PyTorch) is “a deep learning framework made with expression, speed, and modularity in mind” that’s written in C++. Caffe comes with pre-configured training modules, making it a great framework for beginners new to machine learning. Caffe stores and manipulates data in “blobs”, which is a standard array and unified memory interface. Blob properties describe how information is stored and communicated across layers of a neural network. Data scientists who are exploring Caffe are also trying TensorFlow, Theano, Veles, and the Microsoft Cognitive Toolkit.
Learn more about Caffe at https://caffe2.ai/.
Chainer is an open-source neural network Python framework created by a machine learning and robotics startup in Tokyo. Chainer is known for its speed, especially compared to other more “sophisticated” frameworks like Tensorflow. Chainer was the first to provide the “define-by-run” neural network definition, which allows for dynamic changes in the neural network (a benefit when it comes to debugging). It also supports CUDA computation and is inspectable using standard Python tools.
Learn more about Chainer at https://chainer.org/.
Eclipse DeepLearning4j is “first commercial-grade, open-source, distributed deep-learning library written for Java and Scala”. Because it’s distributed, it has can take advantage of multi-CPUs to accelerate training. It’s compatible with any JVM language, such as Scala, Clojure, and Kotlin and works with Spark and Hadoop. With DeepLearning4j, you can create deep neural nets from shallow nets, which form
Learn more about DeepLearning4j at https://deeplearning4j.org/.
Fastai is a deep learning library developed by Jeremy Howard and Rachel Thomas using Python. According to the documentation, Fastai is “a deep learning library which provides practitioners with high-level components that can quickly and easily provide state-of-the-art results in standard deep learning domains, and provides researchers with low-level components that can be mixed and matched to build new approaches.” The Fastai team aims to democratize artificial intelligence and deep learning, and have thus made training for the framework free and open-source.
Learn more at https://www.fast.ai/.
Gluon is an open source deep learning interface from Microsoft and Amazon. The interface allows machine learning developers to quickly develop models without compromising on performance by using pre-built neural network components. This means faster prototyping and training.
Learn more about Gluon at https://gluon.mxnet.io/ .
H2O is an open source, enterprise-ready platform (one of many by the same group) which serves business uses cases in over 20,000 organizations globally. H2O models can be built using commonly used languages like Python and R. It also has “AutoML”, which can automate the machine learning process within user-specified limits. And because it’s distributed, it can support extremely large datasets and maintain speed, making it perfect for enterprise applications.
Learn more about H2O at https://www.h2o.ai/.
Horovod is a free and open-source software framework for distributed deep learning training using TensorFlow, Keras, PyTorch, and Apache MXNet. It was developed by the machine learning engineering team at Uber as part of its Michelangelo platform as a better way to train their distributed TensorFlow models.
Learn more about Horovod at https://github.com/horovod/horovod .
11. Jupyter Notebook
Jupyter Notebook is an open-source, web-based interface for data science, scientific computing, and machine learning workflows. In it, you can create and share documents that contain live code, equations, visualizations and narrative text. Jupyter Notebook supports over 40 programming languages, including Python, R, Julia, and Scala.
Learn more about Jupyter Notebook at https://jupyter.org/.
Keras is open-source data science library that provides a Python interface for artificial neural networks. As of version 2.4, it serves as an interface for the TensorFlow library; previous versions supported TensorFlow, Microsoft Cognitive Toolkit, R, Theano, PlaidML and more. It supports neural-network building blocks such as layers, objectives, activation functions, and optimizers.
Learn more about Keras at https://keras.io.
13. Light GBM
Light GBM is a “gradient-boosting framework” making use of tree-based machine learning algorithms. Its histogram-based algorithm places continuous values into discrete bins, which leads to faster training and efficient memory usage. According to the docs, Light GBM gives data scientists faster training speed and higher efficiency, lower memory usage, better accuracy, support of parallel and GPU learning. It also supports the handling of large-scale data. It’s used for ranking, classification, and other machine learning tasks.
Learn more about Light GBM at https://github.com/microsoft/LightGBM.
Matplotlib is a comprehensive, popular, and open-source Python library for creating “publication quality” visualizations. Visualizations can be static, animated, or interactive. It was emulated off of MATLAB, and thus contains global styles much like MATLAB, including object hierarchy.
Learn more about Matplotlib at https://matplotlib.org/.
15. Microsoft Cognitive Toolkit (earlier known as CNTK)
Learn more about Microsoft Cognitive Toolkit is an open-source toolkit for commercial-grade distributed deep learning. It was one of the first to support ONNX, an open-source shared model representation for “framework interoperability and shared optimization”. It also works with common data science languages including Python and C++ to create commercial-grade AI.
Learn more about Microsoft Cognitive Toolkit at https://docs.microsoft.com/en-us/cognitive-toolkit/.
Apache MXNET is another open-source framework, this time for deep learning. MXNET has deep integrations into Python and support for Scala, Julia, Clojure, Java, C++, R and Perl. One of the main draws of MXNET is the ability to alternate between symbolic programming and imperative programming for maximum productivity. Another draw is the ability to scale and distribute training.
Learn more about MXNET at https://mxnet.apache.org/.
NumPy (“numerical Python”) is another Python programming library, this time used for working with numerical and scientific computing as an array-processing package. NumPy’s speed-optimized C code provides array objects that are 50x faster than Python lists, making them ideal for Data Science purposes.
Learn more about NumPy at https://numpy.org/.
TensorFlow is an “end-to-end open source machine learning platform” that helps data science develop and train machine learning (ML) models. It’s especially useful for efficiently building fast prototypes. Data scientists can write in any language already familiar with them to train and deploy models in the cloud or on-premise.
Learn more about TensorFlow at https://www.tensorflow.org/.
Scikit-learn is an easy to learn, open-source Python library for machine learning built on NumPy, SciPy, and matplotlib. It can be used for data classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.
Learn more about Scikit-learn at https://scikit-learn.org/stable/.
Not a framework but a valuable tool nonetheless, ONNX stands for “Open Neural Network Exchange.” It’s an open-source format designed to represent machine learning models. ONNX gives data scientists a common set of operators and a common file format to use between frameworks, tools, runtimes, and compilers. Existing models can be exported to and from the ONNX format.
Learn more about ONNX at https://onnx.ai/.
Pandas (short for “panel-data-s”) is a machine learning tool used for data exploring, cleaning, transforming, and visualization so it can be used in machine learning models and training. It’s an open-source Python library built on top of NumPy. Pandas can handle three types of data structures: series, DataFrame, and panel.
Learn more about Pandas at https://pandas.pydata.org/.
The Python Plotly library is a plotting library containing over 40 different chart types and visualizations which can then be displayed in Jupyter notebooks, in HTML, or as part of applications built on DASH.
Learn more about Plotly at https://plotly.com/.
Pydot is a Python interface for Graphviz’s Dot that can parse and dump into the DOT language. Pydot lets data scientists handle, modify, and process graphs, as well as show the structure of graphs so they can be used in neural networks.
Learn more about Pydot at https://pypi.org/project/pydot/.
PyTorch is another open-source Python framework that allows data scientists to quickly perform deep learning tasks. PyTorch is used by Salesforce, Stanford University, Udacity, and more to perform Tensor computations and build dynamic neural networks. PyTorch is based on Torch, a C-based open-source deep learning library.
Learn more about PyTorch at https://pytorch.org/.
SciPy is an open-source ecosystem for mathematics and scientific computing such as linear algebra, integration, differential equation solving, and signal processing.. It contains several useful core packages including NumPy, IPython, SciPy Library, MatPlotlib, SymPy, and pandas.
Learn more about SciPy at https://www.scipy.org/.
Shogun is an open-source machine learning library which supports many data science programming languages like Python, Octave, R, Java/Scala, Lua, C#, Ruby. It supports many algorithms like dimensionality reduction algorithms, clustering algorithms, and support vector machines. It’s capable of processing huge datasets, making it a valid choice for enterprise applications.
Learn more about Shogun at https://www.shogun-toolbox.org/.
27. Spark MLib
MLib is Apache Spark’s Machine Learning library; it was developed by UC Berkeley and is capable of processing very large amounts of data at high speeds. It’s 100 times faster than Hadoop for large data processing thanks to its query optimizer and physical optimization engine. Data scientists can write applications in Java, Scala, Python, R, and SQL.
Learn more about Spark MLib at https://spark.apache.org/.
Seaborn is a Python data visualization library for drawing “attractive and informative” statistical graphs. Seaborn is based on Matplotlib. It includes a variety of visualizations to choose from, including time series and joint plots.
Learn more about Seaborn at https://seaborn.pydata.org/.
Theano is a “Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.” It has a tight integration with NumPy, it performs data computations faster than on a CPU, evaluates expressions faster, and contains built-in unit-testing and self-verification. Unfortunately, Theano was last updated in 2017 and is slowly being replaced by other tools.
Learn more about Theano at http://deeplearning.net/software/theano/.
Veles is an open-source tool for binary data analysis. Veles allows data scientists to transform binary code into human-understandable visualizations. So data scientists can even reverse engineer binaries, explore file system images, or engage in Steganography with ease.
Learn more about Veles at https://codisec.com/veles/ .
Xgboost, which stands for eXtreme Gradient Boosting, is an open source tool developed by Tianqi Chen and now part of the Distributed Machine Learning Community (DMLC). Xgboost is a widely popular tool for regression, classification, ranking, model tuning, and algorithm enhancements and has been tested in enterprise-level projects. According to its creator, “…xgboost used a more regularized model formalization to control over-fitting, which gives it better performance.”
Learn more about Xgboost at https://xgboost.readthedocs.io/en/latest/.
Keep Moving Forward with Aptude
Aptude is your own personal IT professional services firm. We provide our clients with first class resources in a continuous, cost-containment fashion.
Our support services will free your senior IT staff from the overwhelming burden of day-to-day maintenance issues. They’ll have time to launch those new projects and applications you’ve been waiting for. Simply put, we can free up your resources and contain your costs. Let’s have a quick chat to discuss our exclusive services.