p19h8l2a5tj8p54ne0o1bvq26q6.jpg
p19h8l2a5v16glev31tvd1b5tarb7.png

1. Hadoop and MapReduce

Hadoop is an open source software framework for storing and processing big data across large clusters of commodity hardware. MapReduce is a programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. Popular Hadoop offerings include Cloudera, Hortonworks and MapR, among others. Image: Hadoop
p19h8l2a5v6c07i16nv16i1vnv8.jpg

2. Database/File System

Hadoop Distributed File System (HDFS) manages the retrieval and storing of data and metadata required for computation. Other popular file system and database approaches include HBase or Cassandra – two NoSQL databases that are designed to manage extremely large data sets. Image: iStock
p19h8l2a5v1f3f1eof1t5mp0o1ld29.png

3. Pig High-Level Programming

Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin. The language abstracts the programming from the Java MapReduce idium, which makes MapReduce programming high level – similar to that of SQL for relational database management systems. Pig was originally developed at Yahoo Research around 2006. In 2007, it was moved into the Apache Software Foundation. Image: Pig/Hadoop
p19h8l2a60ucj1aff1hkhrbraf9a.png

4. Hive Data Warehousing

Apache Hive is a data warehouse platform built on top of Hadoop. It supports querying and managing large datasets across distributed storage. It leverages a SQL-like language called HiveQL. The language also allows traditional MapReduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. Image: Edureka.com
p19h8l2a601lhs1hts1f15sq69i5b.jpg

5. Cascading

Cascading is a Java application development framework for rich data analytics and data management apps running across “a variety of computing environments,” with an emphasis on Hadoop and API compatible distributions, according to Concurrent – the company behind Cascading. Image: iStock
p19h8l2a60rqtkjt1fiodnfa8gc.jpg

6. Big Data Integration Tools

Semi-automated modeling tools such as CR-X allow models to develop interactively at rapid speed, and the tools can help set up the database that will run the analytics. CR-X is a real time ETL (Extract, Transform, Load) big data integration tool and transformation engine. Image: iStock
p19h8l2a614vt1k021siqfij1f2nd.jpg

7. Analytic Databases

Specialized scale-out analytic databases such as Pivotal Greenplum or IBM Netezza offer very fast loading and reloading of data for the analytic models. Image: iStock
p19h8l2a611i46huivr71v031164e.jpg

8. Customer Satisfaction Considerations

Big data analytical packages from ISVs (such as ClickFox) run against the database to address business issues such as customer satisfaction. Image: Pixabay
p19h8l2a611qcd38s12i9p7o17o8f.jpg

9. Transactional Approaches

Transactional big-data projects can’t use Hadoop, since it is not real-time. For transactional systems that do not require a database with ACID (Atomicity, Consistency, Isolation, Durability) guarantees, NoSQL databases can be used – though consistency guarantees can be weak. Scale-out SQL databases, a new breed of offering, also is worth watching in this area. New entrants are emerging all the time. Image: Pixabay
p19h8l2a621pt41kl59oe1lfn1iqdg.JPG

10. Piecing It All Together

The image above shows the major components pieced together into a complete big data solution. Image: Wikibon
p19h8l2a621qj0sss15c2c0norih.jpg

Bonus Content

Check out additional Information Management galleries . Special thanks to Wikibon for many of the perspectives shared in this slide show. Image: Pixabay