Book Review - Hadoop Beginner's Guide by Garry Turkington (Packt Publishing)
This is my second attempt to get familiar with Hadoop and the related ecosystem of technologies that accompany this project. I belong to the group of Java developers that unfortunately do not work (yet) in projects that are in need of services like hadoop or similar technologies. Never the less we are all consuming services every day that are based on the special magic offered by hadoop, we are all producing vast amounts of data that are piled in a remote data center, waiting to be _processed'. That is why I consider investing at least some of my free tech reading time on learning more and more about this technology front, while it becomes more relevant for enterprises, software suppliers, developers and for sure end users. Hadoop - Beginner's Guide was a good fit for beginners like me. I had previous experience with a similar book from Oreilly.
I don't want to be unfair in terms of comparing these 2 books. The first was published at least a year ago while the latter is fairly new - that means up to date. At the same time, hadoop as a research subject is not easy at all, at least this is my personal experience, so the very first time you will seriously start experimenting with it, you need to overcome many beginner's obstacles, like setting up the environment, finding the appropriate version of the API to start hacking examples and of course battle with your own RDBMS developer nature that is built upon years and years working with Oracle, MySQL, SQL Server or other relational databases. Of course the other way, is to get to meet hadoop on your daily work, so another type of learning process will take place. But that was not my case. What I am trying to say is that the very first time you will try to enter this world, especially when there is no daily work initiative, the learning curve is steep. You can have an initial read, try the examples and still you will be asking yourself, ok how I could potentially put all these together in my day to day task?
The next time you go through the very same process a year after, some of the basic ideas are already there, so you are more ready to accept the very same information, consume it or make a step further when trying to use it. I was already a non beginner in terms of the basic idea even basic setup up,writing simple map-reduce examples but I still consider myself a beginner overall.
One of the things that I really liked in this book overall was the small but very helpful what just happened? sections. After a specific code hack or configuration fragment that was illustrated in each topic, these sections were trying to explain step by step, what exactly we tried to do. Eventually this is the actual question of every beginner, when it starts to experiment with an exotic configuration or API. So a really nice idea from the author, to follow the beginner's mentality.
Another good addition on many examples over the several chapters is that was a separate section after each exercise or topic that was illustrating the same principles applied on a far different environment, compared to the local hadoop installation, the Amazon cloud. The book is featuring many of it's examples to the Amazon cloud services (EMR). So along with the local hadoop environment you can experiment on the Amazon infrastructure. Great idea, especially for beginner's that want a crash course on how to specifically make use of Hadoop and related services on the real cloud.
The first 5 chapters cover basic stuff around hadoop from theory to action (writing simple map reduce programs, making use of streaming, graphs, joins and alternative scripting tools along with the Java API. Chapters 6 and 7 go a bit deeper on the hadoop lifecycle, emphasizing on things to consider when things don't work as expected. Configuration is also covered in more detail on this section, so it can be a good reference for more experienced users.
Next, there are chapters that deep dive into tools and technologies like Hive and Flume. I really found interesting Chapter 9 which was covering cases where you combine hadoop with traditional RDMBS systems like MySQL, or how you could import data from an RDBMS to hadoop. Really interesting read. Chapter 11 was also interesting in terms of referencing all the related projects in the Hadoop tech stack, elaborating on their strengths and functionality.
My only (small) problem during reading the book, is the errors on some of the examples. I am sure that all of these are being corrected in the books errata. I have to admit I was sort of expecting such a thing, since I had noticed the very same problem in the first Hadoop book I red a year ago. Maybe it is the technology it'self that keeps changing these recent years, changing API versions, technologies etc, so I can understand that it is really hard to write a book with examples setting up or configuring Hadoop version X that are supposed to be working or being referenced by other material, constantly.
Overall very good book, worth buying if you are a beginner like me, I think I made a small step ahead in my Hadoop learning experience.
You may find the book here.