Apache MXNET is another open-source framework, this time for deep learning. MXNET has deep integrations into Python and support for Scala, Julia, Clojure, Java, C++, R and Perl. One of the main draws of MXNET is the ability to alternate between symbolic programming and imperative programming for maximum productivity. Another draw is the ability to scale and distribute training.
Learn more about MXNET at https://mxnet.apache.org/.
NumPy (“numerical Python”) is another Python programming library, this time used for working with numerical and scientific computing as an array-processing package. NumPy’s speed-optimized C code provides array objects that are 50x faster than Python lists, making them ideal for Data Science purposes.
Learn more about NumPy at https://numpy.org/.
TensorFlow is an “end-to-end open source machine learning platform” that helps data science develop and train machine learning (ML) models. It’s especially useful for efficiently building fast prototypes. Data scientists can write in any language already familiar with them to train and deploy models in the cloud or on-premise.
Learn more about TensorFlow at https://www.tensorflow.org/.
Scikit-learn is an easy to learn, open-source Python library for machine learning built on NumPy, SciPy, and matplotlib. It can be used for data classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.
Learn more about Scikit-learn at https://scikit-learn.org/stable/.
Not a framework but a valuable tool nonetheless, ONNX stands for “Open Neural Network Exchange.” It’s an open-source format designed to represent machine learning models. ONNX gives data scientists a common set of operators and a common file format to use between frameworks, tools, runtimes, and compilers. Existing models can be exported to and from the ONNX format.
Learn more about ONNX at https://onnx.ai/.
Pandas (short for “panel-data-s”) is a machine learning tool used for data exploring, cleaning, transforming, and visualization so it can be used in machine learning models and training. It’s an open-source Python library built on top of NumPy. Pandas can handle three types of data structures: series, DataFrame, and panel.
Learn more about Pandas at https://pandas.pydata.org/.
The Python Plotly library is a plotting library containing over 40 different chart types and visualizations which can then be displayed in Jupyter notebooks, in HTML, or as part of applications built on DASH.
Learn more about Plotly at https://plotly.com/.
Pydot is a Python interface for Graphviz’s Dot that can parse and dump into the DOT language. Pydot lets data scientists handle, modify, and process graphs, as well as show the structure of graphs so they can be used in neural networks.
Learn more about Pydot at https://pypi.org/project/pydot/.
PyTorch is another open-source Python framework that allows data scientists to quickly perform deep learning tasks. PyTorch is used by Salesforce, Stanford University, Udacity, and more to perform Tensor computations and build dynamic neural networks. PyTorch is based on Torch, a C-based open-source deep learning library.
Learn more about PyTorch at https://pytorch.org/.
SciPy is an open-source ecosystem for mathematics and scientific computing such as linear algebra, integration, differential equation solving, and signal processing.. It contains several useful core packages including NumPy, IPython, SciPy Library, MatPlotlib, SymPy, and pandas.
Learn more about SciPy at https://www.scipy.org/.
Shogun is an open-source machine learning library which supports many data science programming languages like Python, Octave, R, Java/Scala, Lua, C#, Ruby. It supports many algorithms like dimensionality reduction algorithms, clustering algorithms, and support vector machines. It’s capable of processing huge datasets, making it a valid choice for enterprise applications.
Learn more about Shogun at https://www.shogun-toolbox.org/.
27. Spark MLib
MLib is Apache Spark’s Machine Learning library; it was developed by UC Berkeley and is capable of processing very large amounts of data at high speeds. It’s 100 times faster than Hadoop for large data processing thanks to its query optimizer and physical optimization engine. Data scientists can write applications in Java, Scala, Python, R, and SQL.
Learn more about Spark MLib at https://spark.apache.org/.
Seaborn is a Python data visualization library for drawing “attractive and informative” statistical graphs. Seaborn is based on Matplotlib. It includes a variety of visualizations to choose from, including time series and joint plots.
Learn more about Seaborn at https://seaborn.pydata.org/.
Theano is a “Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.” It has a tight integration with NumPy, it performs data computations faster than on a CPU, evaluates expressions faster, and contains built-in unit-testing and self-verification. Unfortunately, Theano was last updated in 2017 and is slowly being replaced by other tools.
Learn more about Theano at http://deeplearning.net/software/theano/.
Veles is an open-source tool for binary data analysis. Veles allows data scientists to transform binary code into human-understandable visualizations. So data scientists can even reverse engineer binaries, explore file system images, or engage in Steganography with ease.
Learn more about Veles at https://codisec.com/veles/ .
Xgboost, which stands for eXtreme Gradient Boosting, is an open source tool developed by Tianqi Chen and now part of the Distributed Machine Learning Community (DMLC). Xgboost is a widely popular tool for regression, classification, ranking, model tuning, and algorithm enhancements and has been tested in enterprise-level projects. According to its creator, “…xgboost used a more regularized model formalization to control over-fitting, which gives it better performance.”
Learn more about Xgboost at https://xgboost.readthedocs.io/en/latest/.