Write a sample S3-Select Lambda Function in AWS

What is S3-Select?

S3 Select offered by AWS allows easy access to data in S3. It is a feature that enables users to retrieve a subset of data from S3 using simple SQL expressions. S3 is a large datastore that stores TBs of data. Without S3 Select, you would need to fetch all files in your application to process. However, with S3 Select, you can use a simple SQL expression to fetch only the data from the files you need in filtered and structured form instead of retrieving the entire object.

Source : AWS

According to AWS Stats, by using S3 Select you can achieve drastic performance increases — in many cases as much as a 400% improvement. It is available in all commercial AWS Regions. It can be used using AWS SDK for Java and Python or AWS CLI.

When to use S3-Select?

S3 Select is extremely useful when:

  • You have a large amount of structured data that can be queried upon.
  • Only a small portion of the object is relevant to your current needs.
  • You want to have partial data retrieval ability and get filtered data for user.
  • You want to pre-filter S3 objects before performing additional analysis with tools like Spark or Presto.
  • You want to reduce the volume of data that has to be loaded and processed by your applications thus, improving performance and thus, reducing cost.

Cost of S3-Select

The cost of S3 Select depends on two things — data scanned to query and data returned. For example, if S3 query scans 5GB data and returns 2GB data, the cost for us-east-1 region would be :

Data Scanned by S3 SELECT : $0.002 / GB * 5 = $0.01

Data Returned by S3 SELECT : $0.0007/GB * 2 = $0.0014

Total = $0.0114 per S3 SELECT request.

Terminology to write S3-Select query

To use S3 Select, your data must be structured in either CSV or JSON format with UTF-8 encoding. You can also compress your files with GZIP or BZIP2 before sending to S3 to save on object size. To perform an S3 Select call select_object_content() method is used on the S3 client. When making this call, there are following mandatory pieces of information you’ll need to include:

  1. Bucket — S3 bucket where your file is stored
  2. Key — Path to your file in S3 bucket
  3. ExpressionType — Type of expression. Valid value: SQL
  4. Expression — The SQL expression to retrieve data
  5. InputSerialization — Details of the object that is being queried. It is a container element in which the following can be included
  • CompressionType — Indicates compression type of Amazon S3 object that is being queried. Not mandatory. Valid values: NONE | GZIP | BZIP2
  • CSV | JSON | Parquet — Specifies the format and properties of the Amazon S3 object that is being queried. Mandatory.

6. OutputSerialization — Details of how the results are to be returned. It is a container element in which the following can be included

  • CSV | JSON — Specifies the format and properties of the data that is returned in response. Mandatory.

CSV and JSON are also container elements inside which you can specify RecordDelimiter, FieldDelimiter, QuoteFields etc. properties. To know more about other fields, see the documentation.

Sample of S3 Select in Lambda

Step 1: Login in your AWS account console. Select the region and then select Lambda service.

Step 2: Create a function. Choose build from scratch and create a new IAM role that allows performing operations on S3. You must have s3:GetObject permission for the object you are querying. Choose Python 3.6 as your language. You can go to IAM console in AWS and make changes to your policy there too.

Step 3: On your lambda function dashboard, go to the editor and copy-paste the following code. Write your bucket name instead of BUCKET_NAME and specify the path to your file in that bucket in KEY. We will use Boto3, AWS SDK for Python. Don’t forget to save.

import boto3
import os

def lambda_handler(event, context):
BUCKET_NAME = 'kanika-s3-select'
KEY = '2019/sampledata.csv'
s3 = boto3.client('s3','ap-south-1')
response = s3.select_object_content(
Bucket = BUCKET_NAME,
Key = KEY,
ExpressionType = 'SQL',
Expression = "Select * from s3object s where s.Address='Tokyo'",
InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}},
OutputSerialization = {'JSON': {}},
)

for event in response['Payload']:
if 'Records' in event:
records = event['Records']['Payload'].decode('utf-8')
print(records)

elif 'Stats' in event:
statsDetails = event['Stats']['Details']
print("Bytes scanned: ")
print(statsDetails['BytesScanned'])
print("Bytes processed: ")
print(statsDetails['BytesProcessed'])

Step 4: Configure a HelloWorld test. Since we are not reading any test values, it doesn’t matter what test you configure. Click on Test. You’ll see S3 select has fetched the results.

You can experiment further with different format files and properties in s3 select.

Follow me for more such blogs on AWS services!

Written by

Software Development Engineer @amazon | GSoC’18 Mentor@systers | Open source enthusiast | Love to participate in Hackathons | Optimizations is secret of my code

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store