Skip to main content

Posts

Showing posts from 2013

Centralize logs with logstash

Now a days logging is the essential part of the any application. Logging useful piece of information can easily help to find errors, fixing bug and much more. Modern application now scaling up in hundreds servers in cloud. Managing and monitoring logs of heterogeneous system is very challenging for any system administrator even more challenging for developers to fixing bugs. Last Friday evening we stared our testing with 3rd party products and stuck with a few bugs. As usual we first looked for the log and tried to find some hints to reproduce the bug. Here we got serious problems, our application scaled on few application servers such as Oracle GlassFish, Apache Tomcat. It was not a pleasant moments to search all over the server to find a few pieces of information. Here we have understand that we have to use any tools to manage, monitor and search logs. With my experience we have a few options: 1) Flume, Hadoop hdfs and ElasticSearch. 2) Kafka, Storm and Slor. 3) Logstash and Grayl

Real time data processing with Cassandra, Part 1

This is the first part of getting start with real time data processing with Cassandra. In the first part i am going to describe how to configure Hadoop, Hive and Cassandra, also some adhoc query to use new CqlStorageHandler. In the second part i will show, how to use Shark and Spark for real time fast data processing with Cassandra. I was encourage by the blog from the Data Stax, you can find out it here . Also all the credit goes for the author of the library cassandra-handler and Alex Lui for developing the CQLCassandraStorage. Of course you can use DataStax enterprise version for the first part, Data Stax enterprise version has built in support Hive and Hadoop. In this blog post i will use all the native apache products. If you are interested in Real time data process, please check this blog . In the first part i will use following products: 1) Hadoop 1.2.1 (Single node cluster) 2) Hive 0.9.0 3) Cassandra 1.2.6 (Single node cluster) 4) cassandra-handler 1.2.6 (depends on Hive

An impatient start with Cascading

Last couple of years when i worked with Hadoop, in many blogs and conferences i have heard about Cascading framework on top of Hadoop to implements ETL. But in our lean start up project we decided to used Pig and we implemented our data flow based on Pig. Recently i have got a new book from O'Reilly Media "Enterprise Data Workflows with Cascading" and finished two interesting chapter with one breath. This weekend i have managed a couple of hours to make some try with examples from the Book. Author Paco Nathan very nicely explains why and when you should use Cascading instead of PIg or Hive, even more he gives examples to try at home. My today's blog is to my first expression on Cascading. All the examples of the book could be found from the git hub . I have cloned the project from the Git hub and ready to run the examples. Project Impatient compiles and build with Gradle. I have run gradle clean jar and stacked with the following errors: Could not resolve all

JVM multi-tenancy with ElastiCat

One of the speaker in last JavaOne Moscow mentioned about JVM multi-tenancy and declare it's availability in Java 9. End of the last year IBM also declare their IBM J9 JVM to host multiple applications. Out of curiosity i was interested on this topics and time to time read a few articles and see presentation. Who hasn't familiar with JVM multi-tenancy yet, JVM multi-tenancy is hosting more then one application on single JVM. Here you will found a very good articles about JVM multi-tanancy. Last Friday i have got an e-mail from Dzone mailing list about ElastiCat , which providing jvm multi tenancy and much more. After spending some time i download the products and deploy simple web service on it and play with JVM. Today i have found a few spare hours after lunch and decided to share my experience with ElastiCat. ElastiCat, as they declared, provides Multitenancy, Isolation, Density and Elasticity through Tomcat. You can deploy more than one application on single tomcat and sha

Hadoop Map reduce with Cassandra Cql through Pig

One of the main disadvantage of using PIG is that, Pig always raise all the data from Cassandra Storage, and after that it can filter by your choose. It's very easy to imagine how the workload will be if you have a tons of million rows in your CF. For example, in our production environment we have always more than 300 million rows, where only 20-25 millions of rows is unprocessed. When we are executing pig script, we have got more than 5000 map tasks with all the 300 millions of rows. It's time consuming and high load batch processing we always tried to avoid but in vain. It's could be very nice if we could use CQL query in pig scripts with where clause to select and filter our data. Here benefit is clear, less data will consume, less map task and a little workload. Still in latest version of Cassandra (1.2.6) this feature is not available. This feature is planned in next version Cassandra 1.2.7. However patch is already available for this feature, with a few efforts we

Leader election of Camel router through Zookeeper

Most modern distributed systems somewhere comes across the problem where only one instance of the process or job should runs and others instance of the same job or process should be standby. Apache zookeeper project provides this above functionality out of box. I am not going to describe all the usecases or solutions that could be solved by the Zookeeper. You can read the following great article for quick start. Let me describe our use case with more details, we have apache camel component which polling some resources and only one instance of the component should run in any time. It means there should be one leader and a few followers. When master or leader goes down, any follower should replace the master. You also can use doozerd instead of zookeeper for leader election, but for me zookeeper is easy to maintain. It's very easy to install a multi server zookeeper cluster. I have recommend you to follow the step from this blog to setup a cluster. Now when we have done our cl

Lesson learned : Hadoop + Cassandra integration

After a few weeks break at last we completed our tuning and configuration cassandra hadoop stack in production. It was exciting and i decided to share our experience with all. 1) Cassandra version >> 1.2 has some problems and doesn't integrate with Hadoop very well. The problem with Map Reduce, when we runs any Map reduce job, it always assigns only one mapper regardless of the amount of data. See here for more detail. 2) If you are going to use Pig for you data analysis, think twice, because Pig always picks up all the data from the Cassandra Storage and only after these it can filter. If you have a billions of rows and only a few millions of then you have to aggregate, then Pig always pick up the billions of rows. Here you can find a compression between Hadoop framework for executing Map reduce. 3) If you are using Pig, filter rows as early as possible. Filter fields like null or empty. 4) When using Pig, try to model your CF slightly different. Use Bucket pattern, sto