Ruchira A. Kulkarni
The progressive transition in both scientific and industrial datasets has been the driving force behind the development and study interests in the NoSQL model. Loosely structured data poses a challenge to traditional data store systems, and when working with NoSQL model, these systems are often considered impractical and costly. As the quantity and quality of less structured data grows, so does the demand for a processing pipeline that is capable of seamlessly bind the NoSQL storage model and mapReduce which is “Big Data” processing platform. Although MapReduce is the exemplar of choice for data intensive computing, Java based frameworks like Hadoop requires users to write MapReduce code in Java while Hadoop Streaming module let users to define non Java executables as map and reduce operations. When challenged with legacy C/C++ applications and non Java executables, there arises a further need to permit NoSQL data stores access to the functions of Hadoop Streaming. We present approaches in solving the difficulty of integrating NoSQL data stores with MapReduce using non Java application scenarios, along with benefits and drawbacks of each approach. We compare Hadoop Streaming with our own streaming framework, MARISSA, to see performance implications of coupling NoSQL data stores like Cassandra with MapReduce structure that normally trust on file-system based data stores. this experiments also include Hadoop-C*, which is a configuration where a Hadoop cluster is Located with a Cassandra cluster in order to process data by using Hadoop with non java executables.
Hadoop, Cassandra, NoSQL, Pipelines, Map Reduce