Accelerating Data Analytics using Hadoop, Spark, and TensorFlow
Substantial data growth, different data warehouses, and different data format are hindering organizations to provide a better understanding of their business domain due to the substantial efforts consumed in cleansing and transforming the data into a standard usable data model for processing. Big Data analytics tools and deep learning libraries are substantially used in operation research, recommendation systems, healthcare systems and personalized health outcome improvement, etc., and become increasingly operational in nature. However, this means more data marts, more preprocessing steps, and a more comprehensive reach throughout organizations. Keeping pace with this evolution requires designing of predictive analytics models that provide quantifiable and actionable insights to improve a specific business domain. To this end, organizations start by targeting an innovative answer to a business. Convolutional Neural networks have massive development during the last few years, and they play a significant role in image recognition and automated translation. TensorFlow is a new framework released by Google (almost 2 years ago) for graph-based numerical computations and development of deep learning neural networks. In this tutorial, we are going to demonstrate how to install and configure an environment for Big Data and Deep Learning. Furthermore, we are going to demonstrate how to use TensorFlow and Spark together to train and apply deep learning models to build a data science project. The data science initiative at the University of Calgary is one of the research priority pillars. During Winter 2015 the first undergrad course at the University of Calgary “Engineering Large-Scale Analytics Systems” was offered by the Department of Electrical and Computer Engineering (ECE) and it was designed and delivered by the presenters. Recently, we had the opportunity to build the Multi-Modal Data Fusion (MMDF) lab at the ECE department. The lab is equipped with Hadoop and Spark clusters on a top of commodity hardware. The cluster is utilized to study various data science projects with significant volume of data and to design proof of concept prototypes for different business domains, e.g., autonomous vehicles, healthcare analytics, software development, etc.