- File Size: 9229 KB
- Print Length: 606 pages
- Simultaneous Device Usage: Unlimited
- Publisher: O'Reilly Media; 1 edition (February 8, 2018)
- Publication Date: February 8, 2018
- Sold by: Amazon.com Services LLC
- Language: English
- ASIN: B079P71JHY
- Text-to-Speech: Enabled
- Word Wise: Not Enabled
- Lending: Not Enabled
- Amazon Best Sellers Rank: #108,742 Paid in Kindle Store (See Top 100 Paid in Kindle Store)
Spark: The Definitive Guide: Big Data Processing Made Simple 1st Edition, Kindle Edition
Use the Amazon App to scan ISBNs and compare prices.
Enter your mobile number or email address below and we'll send you a link to download the free Kindle App. Then you can start reading Kindle books on your smartphone, tablet, or computer - no Kindle device required.
To get the free app, enter your mobile phone number.
Customers who bought this item also bought
From the Publisher
Spark’s toolkit-illustrates all the components and libraries Spark offers to end-users.
What Is Apache Spark?
Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters. As of this writing, Spark is the most actively developed open source engine for this task, making it a standard tool for any developer or data scientist interested in big data. Spark supports multiple widely used programming languages (Python, Java, Scala, and R), includes libraries for diverse tasks ranging from SQL to streaming and machine learning, and runs anywhere from a laptop to a cluster of thousands of servers. This makes it an easy system to start with and scale-up to big data processing or incredibly large scale.
Although the project has existed for multiple years-first as a research project started at UC Berkeley in 2009, then at the Apache Software Foundation since 2013-the open source community is continuing to build more powerful APIs and high-level libraries over Spark, so there is still a lot to write about the project. We decided to write this book for two reasons. First, we wanted to present the most comprehensive book on Apache Spark, covering all of the fundamental use cases with easy-to-run examples. Second, we especially wanted to explore the higher-level 'structured' APIs that were finalized in Apache Spark 2.0-namely DataFrames, Datasets, Spark SQL, and Structured Streaming-which older books on Spark don’t always include. We hope this book gives you a solid foundation to write modern Apache Spark applications using all the available tools in the project.
Who This Book Is For
We designed this book mainly for data scientists and data engineers looking to use Apache Spark. The two roles have slightly different needs, but in reality, most application development covers a bit of both, so we think the material will be useful in both cases. Specifically, in our minds, the data scientist workload focuses more on interactively querying data to answer questions and build statistical models, while the data engineer job focuses on writing maintainable, repeatable production applications-either to use the data scientist’s models in practice, or just to prepare data for further analysis (e.g., building a data ingest pipeline). However, we often see with Spark that these roles blur. For instance, data scientists are able to package production applications without too much hassle and data engineers use interactive analysis to understand and inspect their data to build and maintain pipelines.
While we tried to provide everything data scientists and engineers need to get started, there are some things we didn’t have space to focus on in this book. First, this book does not include in-depth introductions to some of the analytics techniques you can use in Apache Spark, such as machine learning. Instead, we show you how to invoke these techniques using libraries in Spark, assuming you already have a basic background in machine learning. Many full, standalone books exist to cover these techniques in formal detail, so we recommend starting with those if you want to learn about these areas. Second, this book focuses more on application development than on operations and administration (e.g., how to manage an Apache Spark cluster with dozens of users). Nonetheless, we have tried to include comprehensive material on monitoring, debugging, and configuration in Parts V and VI of the book to help engineers get their application running efficiently and tackle day-to-day maintenance. Finally, this book places less emphasis on the older, lower-level APIs in Spark-specifically RDDs and DStreams-to introduce most of the concepts using the newer, higher-level structured APIs. Thus, the book may not be the best fit if you need to maintain an old RDD or DStream application, but should be a great introduction to writing new applications.
About the Author
Bill Chambers is a Product Manager at Databricks focusing on large-scale analytics, strong documentation, and collaboration across the organization to help customers succeed with Spark and Databricks. He has a Master's degree in Information Systems from the UC Berkeley School of Information, where he focused on data science.
Matei Zaharia is an assistant professor of computer science at Stanford University and Chief Technologist at Databricks. He started the Spark project at UC Berkeley in 2009, where he was a PhD student, and he continues to serve as its vice president at Apache. Matei also co-started the Apache Mesos project and is a committer on Apache Hadoop. Matei’s research work was recognized through the 2014 ACM Doctoral Dissertation Award and the VMware Systems Research Award.
Would you like to tell us about a lower price?
There was a problem filtering reviews right now. Please try again later.
My only complaint is that you can't use Kindle Cloud Reader. For a normal book it might not be an issue, but for a programming book, you'd probably want to read it on your computer so you can take notes, type in examples, and search. I've bought other O'Reilly books and haven't had this issue in the past (this book seems to be the exception). Right now you're limited to kindle apps so a table might look like this on your phone or tablet:
The more I reference this book, the more I think its a big disadvantage.
After presenting how Spark works and the Structured and low level RDD APIs, the book helps you deploy, monitor, and tune your application to run on a cluster. There is a detailed section on Structured Streaming explaining windowing and event time processing, plus a section on advanced machine learning analytics.
+ Great intro text.
+ Very detailed with lots of code samples.
+ ML section is thorough (if limited in depth)
+ all code is on GitHub :)
+ tuning and optimizations sections
- Organization is a little choppy - to understand Structured Streamimg aggregations requires jumping back and forth to aggregations section (for example)
- Copy-pasting code samples is annoying.
- Kindle for Mac is sucky: resizing windows and adjusting text size breaks the flow, sometimes requiring a restart. Indexing is weird and it ”depaginates”
- Could use a few sections in wide vs narrow...
The book describes clearly and systematically the Spark architecture and has a lot of outstanding examples
that help the reader to become familiar with the rather brilliant Spark programming models.
The presentation of the material is excellent and the explanations are quite supportive and help the understanding.
It is a very nice book on the very admirable Spark system!
As somebody interested in ML applications I was rather disappointed.
Top international reviews
I’m bookmarking virtually every 3rd page because there are such good examples.
Some spelling errors here and there, but well worth the money.
On the other hand a great collection of examples and a great start into Apache Spark!