by: Abay Radhakrishnan
In an earlier blog titled “Big data Viability and technologies on Amazon Cloud AWS ” we discussed the different services in the AWS Platform and its viability in implementing Big Data solutions. The AWS Big Data platform continues to get updated with new services and powerful features, gaining popularity in enterprises supporting many types of use cases.
One of the top trends in IT is “real time streaming data analysis.” This blog gets into the details of the services and standard architecture patterns revolving around the collect, store, process and analyze functions in real-time streaming data to get timely insights to address a variety of use cases.
Popular real-time data processing use cases that are applicable to most enterprises include the following:
- Website click streams
- Application logs and event data from servers and desktops
- Mobile device data captures
- Social media platform feeds
- Web and mobile click stream data processing
- Sensor data coming in from devices, video and audio feeds
- Security compliance needs in real-time
- Logs consolidation and analysis in near-real time
- Feeds from IoT sensors monitoring devices
- IoT Telemetry data
- Temperature and weather changes
- Driving sensors such as GPS coordinates from vehicles and in general
- Any event-driven and event-source solutions
Some examples of industry-specific solutions include financial data, stock transactions, online fraud detection, medical records and patient management.
The Amazon Kinesis suite of services is the central piece of the architecture that addresses the real-time, data ingestion needs integrating with managed services like Amazon EMR, Amazon DynamoDB, Amazon RDS Aurora, Amazon Redshift, AWS IoT and other AWS services (Amazon EC2, Amazon SNS, Amazon SQS) to offer optimal application experiences for analytics and machine learning. The Amazon Kinesis suite of services includes Amazon Kinesis Data Streams (Kinesis streams/KDS), Amazon Kinesis Data Firehose(Firehose), Amazon Kinesis Data Analytics (Kinesis data analytics) and Amazon Kinesis video streams.
Now let’s review the technologies and patterns in the order of collection, processing, storage and analytics.
Amazon Kinesis Data Streams (KDS), often referred to as Amazon Kinesis or Amazon Kinesis streams, can continuously capture gigabytes of data per second from sources such as mobile clients, website click streams, social media feeds, logs (application, network and other types) and events of all kinds. Many real-time solutions aspire to have millisecond response rates, and KDS can make it happen with the data collected available just that quickly. Kinesis stream is an ordered ledger, or an ordered sequence of data records meant to be written to and read from in real-time providing guaranteed ordering, durability, high availability, high throughput and robust integrations with the key services.
Typically, Kinesis stream is treated as a buffer that needs to be processed by a consumer service like Lambda, a custom application running on Amazon EC2 or kinesis date firehose delivery stream. It is possible the Lambda or consumer applications may be down due to programming issues or deployment during execution; with Kinesis, messages can be stored and delivered up to seven days (default: 24 hours).
One common pattern is raw data entering kinesis streams and getting cleaned, filtered or processed into a more meaningful kinesis data stream, after being processed through a AWS lambda function or an Amazon Elastic Container Service (Amazon ECS).
Another best practice is for retention auditing/compliance purposes, streams data can be archived in Amazon S3(S3), Firehose, or using a Lambda function, storing in S3 as needed.
Kinesis Stream is composed of one or more ”shards,” and adding shards to the stream will increase throughput as the data capacity of stream is a function of the number of shard specified. Each shard can support up to 5 Txn/sec for Read, total data read rate of 2 MB per second and up to 1000 records per second for the writes. Resharding enables users to adjust the number of shards in a stream to adapt to changes in the rate of data flow through the stream. When you reshard, data records that were flowing to the parent shards are rerouted to flow to the child shards based on the hash key values that the data record partition keys map to. When a resharding operation is called, there is a waiting period for the streams to become active again. Depending on the types of data, you can have separate streams also for the application.
An Amazon Kinesis Data Streams producer is any application that puts user data records into a Kinesis data stream (also called data ingestion). The following are the different ways through which the data gets recorded on the streams.
The Amazon Kinesis Producer Library (KPL) simplifies producer application development, allowing developers to achieve high write throughput to a Kinesis data stream. The KPL can help build high-performance producers addressing multi-threading, batching, retry logic and de-aggregation at the consumer side. KPL Core is built with C++ module and can be compiled to work on any platform with recent c++ compiler.
Kinesis data streams API in AWS SDKs, allowing users to add multiple records with PutRecords (500) and Single record with PutRecord. KPL is asynchronous by design, so applications that cannot tolerate this additional delay may need to use the AWS SDK directly.
Kinesis agent allows users to Install on app, web, database server and configure by specifying the files to monitor and the streams for the data.AWS IoT integrates directly with Amazon Kinesis.
The source of data collection can be files (logs, media), streams (device, social media) or transactional data (DB I/O). The recommended storage choice is Kinesis for streams, S3 for files, and RDS Aurora and DynamoDB for transactional data. DynamoDB is also leveraged for storing real-time data.
Amazon S3 is used for storing raw streams data, processed streaming data and other content (audio, video, files). Glacier is leveraged using lifecycle policies for retention, archival, audit and compliance purposes.
Long-term storage and historic data analysis are facilitated with the help of Kinesis Firehose.
AWS Lambda plays a key role in processing Kinesis streams. Lambda can be directly triggered by Kinesis streams among other services, allowing it to build a variety of real-time data processing systems. AWS Lambda is the popular option for processing the stream data to handle the raw messages in the stream, transforming to a more useful message, continuing the messages to S3, and updating the transactions in DynamoDB or RDS Aurora. This will also update the analytics application for visualization purposes.
Amazon EMR clusters can read and process Amazon kinesis streams directly using familiar tools in the Hadoop ecosystem such as Hive, Pig, Mapreduce, Hadoop streaming APIs and cascading. Though Hadoop EMR frameworks traditionally address the big data processing more as a batch processing system, there are Hadoop ecosystem frameworks like spark streaming, Twitter storms enabling developers to build applications for continuous stream processing. A solution like the need for a click stream data to identify effective categories of advertisements join real-time data from Kinesis with existing data on DynamoDB and HDFS in a running cluster. For post-processing, data can be loaded in S3.
The Kinesis client library (KCL) acts as an intermediary between your record processing logic and streams. When a KCL application is started, it calls the KCL to instantiate a worker with configuration information such as stream name and AWS credentials. KCL is compiled in to the application to enable fault-tolerant consumption of data from the stream.
Applications using KCL or SDK APIs can leverage spot instances to save money as there is guaranteed delivery of data with Streams. Event-driven program processing can be accomplished by using the framework’s Kinesis applications (KCL enabled apps), Spark streaming, Storm and AWS Lambda. Kinesis can get stream inputs from the Kinesis producer on Amazon EC2 or from other sources using APIs.
Consumer is an application that processes all data from a Kinesis data stream. When a consumer uses enhanced fan-out, each consumer registered to use enhanced fan-out receives its own 2MibM/sec of read throughput per shard, independent of other consumers.
Kinesis Firehose prepares and loads the data by capturing, transforming and loading streaming data in to S3, Redshift, Amazon Elasticsearch Service enabling near real-time analytics with existing business intelligence tools and dashboards. Firehose automatically batches and compresses data before loading it in to Amazon Redshift and makes the data available for processing and analytics.
Redshift can be used for deeper analytical queries and joins across data for business analytics, reporting, and deriving long term insights.
Amazon Kinesis Data Analytics is naturally integrated with both Kinesis Streams and Firehose to run continuous SQL queries against streaming data, while filtering, transforming and summarizing the data in real-time. The set of records processed by a given query can also be controlled by its Windows feature. A popular pattern is historical and exploratory analysis made using Amazon Kinesis Data analytics, Amazon EMR and Amazon Redshift analyzing data lake on Amazon S3 fed by a Kinesis data firehose. Analysis after-the-fact, on-time series data, and Machine Learning analysis can be done as well.
Amazon Elastic Search (ES) is ideal for querying and searching large amounts of data, and can analyze data stream updates from other AWS services such as Kinesis streams and DynamoDB. This provides customers with a rich search and navigation experience.
Amazon QuickSight is a fast, cloud-powered business analytics service that integrates with RDS Aurora, DynamoDB, Redshift and few other data sources to make it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights.
This diagram provides a brief overview of the integration between the services:
Real-time streaming solutions can be completely serverless with the combination of tools available, such as Kinesis Streams, Lambda, S3, AWS IoT, Kinesis Firehose, Redshift, and Kinesis data analytics. Several enterprises have implemented serverless solutions, and EC2 services or managed services like Amazon EMR are also popular in implementing real-time using Kinesis services.
As you can see, AWS Cloud is positioned well to meet the challenges of real-time streaming analytics with economical tooling as well as the ability to manage the volume, velocity, and variety of data/formats, providing a highly available, durable and scalable platform. Serverless should be a consideration while architecting your solutions as data processing needs continue to amplify. Kinesis makes everything much more manageable in comparison to handling compute instances, and can be quite cost-effective.
Kinesis services: https://aws.amazon.com/kinesis/
Use cases: http://aws.amazon.com/big-data/use-cases/