Getting started with AWS serverless architecture: Tutorial on Kinesis and Dynamo ...

In this post, we will discuss serverless architecture and give simple examples of getting starting with serverless tools, namely using Kinesis and DyanmoDB to process Twitter data.

Many companies are now shifting towards a serverless model. There are currently three main contenders in this space, namely Amazon Web Services (AWS), Google Computing Platform (GCP) and Microsoft Azure. In this introductory post, we focus on setting up serverless architecture on AWS.

Introduction to serverless architecture

There have been many paradigms for organizations to collect, store and process large amounts of data effectively.

Before any cloud computing, organizations had to build out their own computing infrastructure and maintain severs in house. The rise of cloud infrastructure gave rise to a paradigm shift where companies could suddenly spin up hundreds of EC2 instances on AWS and instantly have a place where they could set up their infrastructure.

Open source tools like Hadoop, Cassandra, HBase, Hive, Storm etc took off. These tools were well suited for the model of spinning up N number of severs and set up their back-end platform.

To minimize the cost of having a team to manage a collection of severs to store and process data, large tech companies like Amazon and Google have developed managed tools that take the place of tools that require one to setup and maintain individual severs. While these tools will generally run on multiple severs, this fact is abstracted from the user of the tool.

Kinesis andDynamoDB Intro to KinesisStreams

Amazon Kinesis is a tool used for working with data in streams. It has a few features ― Kinesis Firehose, Kinesis Analytics and Kinesis Streams and we will focus on creating and using a Kinesis Stream. Kinesis streams has standard concepts as other queueing and pub/sub systems.

The figure and bullet points show the main concepts of Kinesis

Getting started with AWS serverless architecture: Tutorial on Kinesis and Dynamo ...
A stream : A queue for incoming data to reside in. Streams are labeled by a string. For example, Amazon might have an “Orders” stream, a “Customer-Review” stream, and so on. A shard : A stream can be composed of one or more shards. One shard can read data at a rate of up to 2 MB/sec and can write up to 1,000 records/sec up to a max of 1 MB/sec. A user should specify the number of shards that coincides with the amount of data expected to be present in their system. Pricing of Kinesis streams is done on a per/shard basis Producer: A producer is a source of data, typically generated external to your system in real-world applications (e.g. user click data) Consumer : Once the data is placed in a stream, it can be processed and stored somewhere (e.g. on HDFS or a database). Anything that reads in data from a stream is said to be a consumer. Partition Key : The Partition Key is a string that is hashed to determine which shard the record goes to. For instance, given record r = {name: ‘Jane’, city: ‘New York’} one can, for example, specify the Partition Key as r[“city”]which will send all the records with the same city to the same shard. Intro toDynamoDB

DynamoDB is Amazon’s distributed key-value store, optimized for fast read and writes of data. Like many other distributed key-value stores, its query language does not support joins but is optimized for fast reading an writing of data allowing for a more flexible table structure than traditional relational models.

Some key concepts include

Partition key : Like all key-value stores, a partition key is a unique identifier for an entry. This allows for O(1) access to a row by the partition key in DynamoDB. Sort key : Each row can be broken up and have some additional structure. Each row can be thought of as a hashmap ordered by the keys. In the language of Java, I personally like to think of DynamoDB as a “ HashMap <T, TreeMap <U, V>> ” where T, U and V are generic types allowed by DynamoDB. Note that a TreeMap is essentially a HashMap where the keys are sorted. Provisioned Throughput : This is the amount of reads and writes you expect your table to incur. Pricing of DynamoDB is done based on how many reads/writes you provision

The figure below shows a mock-up of a twitter feed table. Each row contains all the tweets that should show up in a given users Twitter feed. The PrimaryKey is is the userid or Twitter Handle and the SortKey is the timestamp at which the tweet was originally created.

Getting started with AWS serverless architecture: Tutorial on Kinesis and Dynamo ...
Example using Twitterdata Cost of running thistutorial

The following sections show how one can ingest Twitter data into Kinesis and store hashtags in DynamoDB. We will be using the minimal amount of resources necessary in this tutorial, namely 1 stream and 1 shard in Kinesis, which costs less than 0.02 per hour (pricing is dependent on the region you choose). DynamoDB will also cost less than 0.02 per hour.

Prerequisites: AWS andboto

If you do not already have one, you can setup an AWS account here for free. We will connect to the AWS ecosystem using the boto library in python. Using pip, one can easily install the latest version of boto, namely

pip install boto3

You can specify the physical region in which all your data pipeline resides via a config file, located in ~/.aws/config, open this file by the command

$ nano ~/.aws/config

and add the following (feel free to modify the region)


Export your AWS keys in terminal, namely

$ nano ~/.bash_profile

and copy


into bash_profile and be sure to run

$ source ~/.bash_profile

to make sure that environment variables have been set. This will allow you make use of the boto library without having to specify the region or AWS credentials in your scripts.

Prerequisites: Twitter credentials

To get started, you will need credentials from Twitter to make calls to its public API. To this end, go to and click create a new app and fill out the form to get your unique credentials for making requests from Twitter for their data.

Once you have your Twitter credentials you can put them in a config-like file called, which would look like

## template of what should look like
## the actual twitterCreds file should contain
## your actual Twitter credentials that one can
## obtain/create here: consumer_key = "XXXxxXXX"
consumer_secret = "xxXXxXX"
access_token_key = "XXX-XXxxXXX"
access_token_secret = "XXxxXXX"

If you are using GitHub to work on this tutorial, please be sure to add in your.gitignore file to avoid putting your credentials on GitHub

Feeding data from Twitter to a Kinesisstream To put data from Twitter into a Kinesis stream, we use the boto library in Python to create a


主题: GitGitHubHiveHDFSHadoopCassandraJavaHBaseTwitterPython
本文标题:Getting started with AWS serverless architecture: Tutorial on Kinesis and Dynamo ...

技术大类 技术大类 | 数据库(综合) | 评论(0) | 阅读(17)