'Big Data Analytics Beyond Hadoop:' Foreword of the book One point that I attempt to impress upon people learning about Big Data is that while Apache Hadoop is quite useful, and most certainly quite successful as a technology, the underlying premise has become dated. Consider the timeline: MapReduce implementation by Google came from work that dates back to 2002, published in 2004. Yahoo! began to sponsor the Hadoop project in 2006. MR is based on the economics of data centers from a decade ago. Since that time, so much has changed: multi-core processors, large memory spaces, 10G networks, SSDs, and such, have become cost-effective in the years since. These dramatically alter the trade-offs for building fault-tolerant distributed systems at scale on commodity hardware. Moreover, even our notions of what can be accomplished with data at scale have also changed. Successes of firms such as Amazon, eBay, and Google raised the bar, bringing subsequent business leaders to rethink, “What can be performed with data?” For example, would there have been a use case for large-scale graph queries to optimize business for a large book publisher a decade ago? No, not particularly. It is unlikely that senior executives in publishing would have bothered to read such an outlandish engineering proposal. The marketing of this book itself will be based on a large-scale, open source, graph query engine described in subsequent chapters. Similarly, the ad-tech and social network use cases that drove the development and adoption of Apache Hadoop are now dwarfed by data rates from the Industrial Internet, the so-called “Internet of Things” (IoT)—in some cases, by several orders of magnitude. The shape of the underlying systems has changed so much since MR at scale on commodity hardware was first formulated. The shape of our business needs and expectations has also changed x BIG DATA ANALYTICS BEYOND HADOOP dramatically because many people have begun to realize what is possible. Furthermore, the applications of math for data at scale are quite different than what would have been conceived a decade ago. Popular programming languages have evolved along with that to support better software engineering practices for parallel processing. Dr. Agneeswaran considers these topics and more in a careful, methodical approach, presenting a thorough view of the contemporary Big Data environment and beyond. He brings the read to look past the preceding decade’s fixation on batch analytics via MapReduce. The chapters include historical context, which is crucial for key understandings, and they provide clear business use cases that are crucial for applying this technology to what matters. The arguments provide analyses, per use case, to indicate why Hadoop does not particularly fit—thoroughly researched with citations, for an excellent survey of available open source technologies, along with a review of the published literature for that which is not open source. This book explores the best practices and available technologies for data access patterns that are required in business today beyond Hadoop: iterative, streaming, graphs, and more. For example, in some businesses revenue loss can be measured in milliseconds, such that the notion of a “batch window” has no bearing. Real-time analytics are the only conceivable solutions in those cases. Open source frameworks such as Apache Spark, Storm, Titan, GraphLab, and Apache Mesos address these needs. Dr. Agneeswaran guides the reader through the architectures and computational models for each, exploring common design patterns. He includes both the scope of business implications as well as the details of specific implementations and code examples. Along with these frameworks, this book also presents a compelling case for the open standard PMML, allowing predictive models to be migrated consistently between different platforms and environments. It also leads up to YARN and the next generation beyond MapReduce. FOREWORD xi This is precisely the focus that is needed in industry today—given that Hadoop was based on IT economics from 2002, while the newer frameworks address contemporary industry use cases much more closely. Moreover, this book provides both an expert guide and a warm welcome into a world of possibilities enabled by Big Data analytics. Contents: Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix About the Author. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Chapter 1 Introduction: Why Look Beyond Hadoop Map-Reduce? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Hadoop Suitability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Big Data Analytics: Evolution of Machine Learning Realizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Chapter 2 What Is the Berkeley Data Analytics Stack (BDAS)? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21 Motivation for BDAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 BDAS Design and Architecture. . . . . . . . . . . . . . . . . . . . . 26 Spark: Paradigm for Efficient Data Processing on a Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Shark: SQL Interface over a Distributed System . . . . . . . 42 Mesos: Cluster Scheduling and Management System . . . 46 Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Chapter 3 Realizing Machine Learning Algorithms with Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .61 Basics of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . 61 Logistic Regression: An Overview . . . . . . . . . . . . . . . . . . . 67 Logistic Regression Algorithm in Spark. . . . . . . . . . . . . . . 70 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . . 74 PMML Support in Spark . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Machine Learning on Spark with MLbase . . . . . . . . . . . . 90 References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 viii BIG DATA ANALYTICS BEYOND HADOOP Chapter 4 Realizing Machine Learning Algorithms in Real Time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93 Introduction to Storm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Design Patterns in Storm . . . . . . . . . . . . . . . . . . . . . . . . . 102 Implementing Logistic Regression Algorithm in Storm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Implementing Support Vector Machine Algorithm in Storm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Naive Bayes PMML Support in Storm . . . . . . . . . . . . . . 113 Real-Time Analytic Applications . . . . . . . . . . . . . . . . . . . 116 Spark Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Chapter 5 Graph Processing Paradigms. . . . . . . . . . . . . . . . . . . . .129 Pregel: Graph-Processing Framework Based on BSP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 Open Source Pregel Implementations. . . . . . . . . . . . . . . 134 GraphLab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Chapter 6 onclusions: Big Data Analytics Beyond Hadoop Map-Reduce. . . . . . . . . . . . . . . . . . . . . . . . . . .161 Overview of Hadoop YARN . . . . . . . . . . . . . . . . . . . . . . . 162 Other Frameworks over YARN . . . . . . . . . . . . . . . . . . . . 165 What Does the Future Hold for Big Data Analytics? . . . 166 References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Appendix A Code Sketches . . . . . . . . . . . . . . . . . . . . . . . . .171 Code for Naive Bayes PMML Scoring in Spark . . . . . . . 171 Code for Linear Regression PMML Support in Spark . . . . . . . . . . . . . . . . . . . . . . . . 182 Page Rank in GraphLab . . . . . . . . . . . . . . . . . . . . . . . . . . 186 SGD in GraphLab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Title
Big Data Analytics Beyond Hadoop: Real-Time Applications with Storm, Spark, and More Hadoop Alternatives
Vijay Srinivas Agneeswaran, Ph.D., has a Bachelor’s degree in Computer Science & Engineering from SVCE, Madras University (1998), an MS (By Research) from IIT Madras in 2001, and a PhD from IIT Madras (2008). He was a post-doctoral research fellow in the Distributed Information Systems Laboratory (LSIR), Swiss Federal Institute of Technology, Lausanne (EPFL) for a year. He has spent the last seven years with Oracle, Cognizant, and Impetus, contributing significantly to Industrial R&D in the big data and cloud areas. He is currently Director of Big Data Labs at Impetus. The R&D group provides thoughtleadership through patents, publications, invited talks at conferences, and next generation product innovations. The main focus areas for his R&D include big data governance, batch and real-time analytics, as well as paradigms for implementing machine learning algorithms for big data. He is a professional member of the Association of Computing Machinery (ACM) and the Institute of Electrical and Electronics Engineers (IEEE) for the last eight+ years and was elevated to Senior Member of the IEEE in December 2012. He has filed patents with U.S., European, and Indian patent offices (with two issued U.S. patents). He has published in leading journalsand conferences, including IEEE transactions. He has been an invited speaker in several national and international conferences such as O’Reilly’s Strata Big-Data conference series. His recent publications have appeared in the Big Data journal of Liebertpub. He lives in Bangalore with his wife, son, and daughter, and enjoys researching ancient Indian, Egyptian, Babylonian, and Greek culture and philosophy.