10 Big Data Software Requirements

March 26, 2015 1:07 PM

1. Hadoop and MapReduce

Hadoop is an open source software framework for storing and processing big data across large clusters of commodity hardware. MapReduce is a programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. Popular Hadoop offerings include Cloudera, Hortonworks and MapR, among others. Image: Hadoop

2. Database/File System

Hadoop Distributed File System (HDFS) manages the retrieval and storing of data and metadata required for computation. Other popular file system and database approaches include HBase or Cassandra two NoSQL databases that are designed to manage extremely large data sets. Image: iStock

3. Pig High-Level Programming

Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin. The language abstracts the programming from the Java MapReduce idium, which makes MapReduce programming high level similar to that of SQL for relational database management systems. Pig was originally developed at Yahoo Research around 2006. In 2007, it was moved into the Apache Software Foundation. Image: Pig/Hadoop

4. Hive Data Warehousing

Apache Hive is a data warehouse platform built on top of Hadoop. It supports querying and managing large datasets across distributed storage. It leverages a SQL-like language called HiveQL. The language also allows traditional MapReduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. Image: Edureka.com

5. Cascading

Cascading is a Java application development framework for rich data analytics and data management apps running across a variety of computing environments, with an emphasis on Hadoop and API compatible distributions, according to Concurrent the company behind Cascading. Image: iStock

6. Big Data Integration Tools

Semi-automated modeling tools such as CR-X allow models to develop interactively at rapid speed, and the tools can help set up the database that will run the analytics. CR-X is a real time ETL (Extract, Transform, Load) big data integration tool and transformation engine. Image: iStock

7. Analytic Databases

Specialized scale-out analytic databases such as Pivotal Greenplum or IBM Netezza offer very fast loading and reloading of data for the analytic models. Image: iStock

8. Customer Satisfaction Considerations

Big data analytical packages from ISVs (such as ClickFox) run against the database to address business issues such as customer satisfaction. Image: Pixabay

9. Transactional Approaches

Transactional big-data projects cant use Hadoop, since it is not real-time. For transactional systems that do not require a database with ACID (Atomicity, Consistency, Isolation, Durability) guarantees, NoSQL databases can be used though consistency guarantees can be weak. Scale-out SQL databases, a new breed of offering, also is worth watching in this area. New entrants are emerging all the time. Image: Pixabay

10. Piecing It All Together

The image above shows the major components pieced together into a complete big data solution. Image: Wikibon

Bonus Content

Check out additional Information Management galleries . Special thanks to Wikibon for many of the perspectives shared in this slide show. Image: Pixabay

Analytics

Data and information management