by Abay Radhakrishnan
Big data solutions are typically associated with using the Apache Hadoop framework and supporting tools in both on-premises and cloud infrastructures. This article aims to create awareness of the holistic role that Amazon Web Services (AWS) plays in big data processing and to provide a high-level reference architecture on how AWS services can come together to create Big data solutions. It will also highlight the advantages of complimentary AWS services that enhance the overall Hadoop experience.
By using AWS, many enterprises from diverse industries have successfully implemented big data solutions.
So, what’s the motivation for employing AWS big data services?
On-demand service, scalability, elasticity, pay-per-use (for CapEx reduction), server-less technologies, security, the overall cost of processing and the ability to support the volume and velocity to infinite (almost) capacity are what attract many companies to adapt or try out the big data solutions on this platform.
Additionally, there are other analytics services like Amazon QuickSight, CloudSearch, Machine Learning, Rekognition (deep learning-based image recognition) AWS IoT (Internet of Things) that can be leveraged in the solutions.
Amazon EMR supports different versions of the Apache Hadoop framework, plus a few versions of the extended Hadoop framework offered by the MapR framework and Cloudera CDH for big data processing. Amazon EMR also runs and scales popular distributed frameworks such as Apache Spark, HBase, Presto, Flink and other open source projects.
There are a few perspectives regarding the definition of EMR besides the popular notion that “Amazon EMR is Hadoop in the Cloud.”
Amazon EMR is:
- An integrated solution with both infrastructure and software to implement big data use cases.
- A “managed service” with the set of tools that makes Hadoop easily implementable including different versions of Hadoop and MapR.
- A web service that makes it easy to quickly and cost effectively process vast amounts of data in a distributed way.
- Integration of over 30 open source projects to support the Big data needs of storage, cluster management(Yarn), processing - Batch(MapReduce), Interactive(Tez), In memory(Spark), Streaming(Flink) and application frameworks (Hive, Pig, Spark SQL/ML,...)
When creating and launching a cluster with Amazon EMR, Amazon Elastic Compute Cloud (EC2) instances will launch under Master, Core and Task instance groups that are preloaded from the Amazon Machines Image (AMI) with Hadoop and other big data applications. There will always be one Master EC2 instance per cluster that assigns Hadoop tasks to the Core and Task nodes and monitors their status. Core instance groups have one or more EC2 instances that execute Hadoop map and reduce tasks and store data in Hadoop Distributed File System (HDFS). Task instance groups are optional and can contain one or more EC2 instances with no local storage that can handle map, reduce or shuffle tasks and can be scaled as required. Spot Instances are often used for this function.
The pictures below illustrate what an Amazon Hadoop distribution may look like. The standard Apache Hadoop framework will be installed with Hive, Pig and Hue. Additional applications already integrated with Amazon EMR such as Spark, Impala, HBase, and Ganglia can be installed by selecting them from the drop down. The environment is also compatible with any Open-source projects that run on top of the Hadoop architecture.
An enterprise implementing a big data solution should consider the following attributes: cost, the flexibility of deployment, ease of use and operations, data center location, performance, and security. Let’s take a look at the advantages and the tooling associated with AWS big data implementations.
Compute and processing advantages
- On-Demand virtual instances and instance types to meet the CPU, memory and disk I/O needs
- High-availability clusters
- Ease of launching Hadoop, MapR and big data framework clusters
- Support for open-source frameworks
- Dynamic creation of clusters for processing jobs
- Low end-to-end latency
- Highly scalable and elastic
- The EMR File System (EMRFS) and the Hadoop Distributed File System (HDFS) are both installed on the EMR cluster. HDFS stores data on core nodes of EMR cluster, while EMRFS allows clusters to store data on S3. For the Apache Hadoop framework, EMRFS is the suitable option for batch workloads and transient storage data on S3; for interactive workloads and long running clusters, HDFS is suitable for storing data
Amazon S3, DynamoDB, RDS Aurora, Kinesis (Real-time streaming) are the suggested storage solution for many frameworks like Hive, Pig, Spark in the AWS big data platform offering scalability and availability.
- The S3 object storage service provides unlimited storage, offers permanent storage for data, automated storage replication within and across regions, supports versioning and is engineered for high durability, availability and scalability.
- Backup/Disaster Recovery (DR) in a traditional on-premises big data environment can be expensive with active-active replication with all systems and components configured and waiting. By comparison, AWS supports cost effective options such as Pilot Light, Warm standby and Backup & Restore in addition to the above mentioned multi-site active-active cluster option. More details can be found in the AWS white paper “Using Amazon Web Services for Disaster Recovery.”
- AWS DynamoDB is a managed NoSQL data service providing unlimited database, table sizes, unlimited concurrency, 3 Availability Zones (AZ) replication, no throughput limits, and supporting read-write-update intensive I/O operations and stream APIs.
- Amazon Aurora is a managed Relational DB service which is MySQL compatible, can support thousands of concurrent users, 3 AZ replication,15 read replicas and read-write-update intensive I/O operations.
- Amazon Kinesis is used to solve a variety of streaming data problems. A common use is the real-time aggregation of data followed by loading the aggregate data into a data warehouse or map-reduce cluster.
- With AWS Virtual Private Cloud (VPC), solutions can be hosted in a private network within AWS. VPC supports a variety of connectivity options between on-premises and AWS for data transfer.
- Amazon provides high-speed virtual machines that are proven to have low latency and high bandwidth between the same AZ, across AZs and between regions.
EMR Advantages and Unique Features
- Managed Hadoop service with version integrity between Hadoop, Spark, Hive, Pig and Impala
- Pay-per-use and On-demand managed service that takes care of node provisioning, cluster setup, Hadoop configuration, versioning and cluster tuning in a matter of few minutes
- Ability to configure the Hadoop environment with XML files or by changing the key value pairs and configuration values
- On-demand EC2 instances for core nodes and spot instances for task nodes to reduce the processing costs
- EMRFS provides better performance and error handling, faster listing and consistency of S3 objects including client/server-side encryptions
- Bootstrapping: Install additional software to change the configuration of applications on the cluster and to change the Hadoop settings
- Cluster resizing by adding nodes to the core group or to add or remove nodes from task groups
- Real-time streaming data processing
- AWS EMR versions get released almost on a 6 to 8 week schedule to keep up with the advancements and enhancements.
Simple Reference Architecture
This diagram depicts a simple use case of log file analysis in which the data text files are available in an S3 bucket as input and processed by an Amazon EMR cluster, with the processed data being written to another S3 bucket. Subsequently, the results can be stored in Redshift for supporting analytics dashboards and operational reports.
Big data Reference architecture with different services
This diagram represents a number of workflows and specific combinations that can come together as a reference architecture for implementing Big data solutions. Typically, the solution can be divided into four stages--collect, store, process and analyze--with different AWS services and flows joining them together.
The source of data collection can be files (logs, media), streams (device, social media) or transactional data (DB I/O). The recommended storage choice is S3 for files, Kinesis for streams and RDS Aurora and DynamoDB for transactional data. Amazon EMR is the main processing engine.
Event driven programming-processing can be accomplished by using the framework’s Kinesis applications (KCL enabled apps), Spark streaming, Storm and AWS Lambda. Kinesis can get stream inputs from the Kinesis producer on Amazon EC2 or from other sources using APIs. The shards will be subsequently processed by the Kinesis-enabled applications or by the EMR clusters (Hive/Pig/cascading Spark streaming) and then the data will be directed to DynamoDB.
Athena is interactive query service, fast, cost effective and architected to easily analyze petabytes of data in S3 with no data warehouses or clusters to manage.
Amazon QuickSight is a fast, cloud-powered business analytics service that integrates with RDS Aurora, DynamoDB, Redshift and few other data sources to easily build visualizations, perform ad-hoc analysis, and quickly get business insights.
ElasticSearch is a fully managed service that delivers scalable and easy to use Elasticsearch service for log analytics, full-text search, real-time processing and application monitoring, The service offers actionable insights with built-in integrations with Kibana, Logstash, and AWS services including Amazon Kinesis Firehose, AWS Lambda, and Amazon CloudWatch.
AWS Lambda provides stateless request driven execution of code in response to events associated with the services such as S3, DynamoDB and Kinesis stream. Lambda functions can also be invoked on demand and all the compute capacity and resources needed is automatically managed by AWS by spinning the necessary infrastructure, managing the code and execution.
Amazon Redshift is a fast, fully managed, cost-effective petabyte-scale data warehouse service to efficiently analyze data using existing business intelligence tools. Redshift is well integrated with S3, DynamoDB, EMR, Kinesis and SSH-enabled host. Different analytics tools can interact with Redshift and applications such as Hive and Impala to produce visuals like reports and/or dashboards.
AWS Data Pipeline is a web service that helps customers reliably process and move data between different AWS compute and storage services (including Amazon EMR) as well as on-premises data sources at specified intervals.
Here is a simple illustration of a selected workflow from the reference diagram above in which a log stream from a web application is pushed into Amazon Kinesis shard, the standard Hive-Pig-Cascading MapReduce cluster in EMR pulls the data into Hive for processing, optionally stores the output and allows an analytics tool to display the data.
Simple Reference Architecture with AWS Lambda
AWS Lambda can be a powerful Big data processing option managing the processing triggered by the events in S3, DynamoDB or Amazon Kinesis or invoked on demand by an application. The following diagram depicts a simple use case of log file analysis in which the data text files are available in an S3 bucket as input and processed by Lambda functions with the processed data being written to another S3 bucket. Subsequently, the results can be stored in Redshift for supporting analytics dashboards and operational reports.
Functions and Tools:
This table documents the list of technologies by functions that play a role in the Big data solutions. They are listed using the solution pattern collect, store, process and analyze. Standard AWS services in compute, storage and networking form the IaaS (Infrastructure as a Service) layer.
|Collect||Dedicated Low latency bandwidth||AWS Direct Connect|
|Collect||Import/Export||Physical Media shipping|
|Collect||Scalable stream processing – Events||Amazon Kinesis|
|Collect||Sync local storage to the Cloud||Amazon storage Gateway|
|Collect||Client/Sensor||High-Level programming like Java, C++, PHP|
|Collect||Large amounts of Import/export||AWS Snowball|
|Process||Event processing||AWS Lambda, KCL Apps, SNS|
|Process||Data processing||Amazon EMR|
|Process||Continuous Processor||Storm, Hive, Spark|
|Process||Processing and Query||Hive, Impala, Presto|
|Process||Extract||Scoop, MapReduce API(MR)|
|Process||Transform & Load||Pig, MR, Cascading, Spark|
|Analyze||Ad hoc, Interactive query||Athena|
|Analyze||Predictive analysis||Machine Learning|
|Analyze||Report Generation||Hive, Spark, Cascading, Pig|
|Analyze||Adhoc analysis||Hive, Spark-SQL, Presto, Lingual|
|Analyze||Analytics and Reporting||Phoenix, Mahout|
|Analyze||Business insights, Visualization||Amazon QuickSight|
The AWS big data platform can support many types of use cases. With managed services like EMR, DynamoDB, RDS Aurora and Redshift, the platform provides uniform services that are fast, easy to implement, and can be combined with other AWS services to offer optimal application experiences. Any enterprise planning for a new big data solution, or looking to migrate to a highly available and scalable platform, should consider AWS and its supporting services as a possible option.
EMR Quick Reference Links: http://s3.amazonaws.com/awsdocs/ElasticMapReduce/latest/emr-qrc.pdf
Use cases: http://aws.amazon.com/big-data/use-cases/
Hadoop projects and technologies: https://hadoop.apache.org/