Big data technologies allows you to deal with massively large data sets. Common applications are meteorology, genomics,connectomics, complex physics simulations, biological and environmental research,Internet search, finance and business informatics. Technologies like Hadoop, Pig, Hive, R and many others are there to help you overcome limitations of capturing, storing, search, sharing, analyzing and visualizing of large datasets. We, at sysmoth, helps you configure and deploy such technologies to suit your particular need.
The Apache Hadoop is an open-source initiative backed by a global community for reliable, scalable, distributed computing that supports data-intensive distributed applications. It allows for the distributed processing of large data sets (petabytes of data) across clusters of computers (MapReduce framework) and each offering local computation regardless of structure. Storage and failures are handled at the application layer and scalability can be achieved from single to thousands of machines. More value can be added to Hadoop infrastructure using Apache projects, such as Pig, Hive and Zookeeper. Google and Yahoo have played significant roles in the Hadoop project from its initiation to maturity. Hadoop is now considered by many as the de facto standard for big data processing.
Apache Pig is a platform for analysing large data sets and providing an easy way to create MapReduce programs used with Hadoop. Pig was initially developed at Yahoo research for their own projects and later submitted to Apache Software Foundation. Pig consists of a high-level language â€œPig Latinâ€ which is easy to learn and maintain. Apache Pig and Hive can be used to achieve similar solutions accompanied with Hadoop but the choice may vary depending upon the userâ€™s requirement.
The R Project
R is an open source software environment for statistical computing and graphics. Its intended users are mostly statistician. It is used for primarily analysing data and developing statistical software. The environment is written using combination of C, Fortran and R. Various GUIs are available including a command line interface. It is compiled and supported for UNIX, Windows and MacOS.
To overcome restriction of Hadoop usage to Java Programmers and error prone Java APIs, â€œHiveâ€ was developed as a solution to improve programmability and enabling Hadoop to operate as data warehouse infrastructure. Hive allow to impose structure on data and query using its own SQL type language â€œHiveQLâ€. Apache cites that Hive â€œfacilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systemsâ€ Hadoop cannot adapt to low latencies on queries, using Hive jobs can be submitted and notified when complete. Query response time on small jobs can average from 5-10 minutes to hours for larger jobs. Hive is more suitable for data with static structure and need for frequent analysis when compared to Pig.