
Shashi Shankar
Mar 20, 2023
Introduction to Redshift
Amazon Redshift is a fully managed, petabyte-scale data warehouse service provided by Amazon Web Services (AWS). It is designed to handle large datasets and complex analytics workloads, making it suitable for data warehousing, business intelligence, and data analytics applications.
Key features of Amazon Redshift include:
Columnar Storage: Redshift stores data in a columnar format, which improves query performance by reducing I/O operations and optimizing data compression.
Massively Parallel Processing (MPP): It leverages a distributed architecture across multiple nodes to parallelize query execution, enabling high performance and scalability for data processing.
Advanced Compression: Redshift automatically compresses data and applies encoding techniques to minimize storage requirements and improve query performance.
Integration with AWS Ecosystem: Redshift seamlessly integrates with other AWS services such as Amazon S3, AWS Glue, AWS Lambda, and Amazon Kinesis for data ingestion, data transformation, and data processing.
SQL Compatibility: Redshift supports standard SQL queries, making it easy for users familiar with SQL to interact with the data warehouse and perform analytics tasks.
Concurrency and Workload Management: It provides features for managing concurrent user queries and workload prioritization, ensuring efficient resource utilization and performance optimization.
Security and Compliance: Redshift offers robust security features such as encryption at rest and in transit, access controls, audit logging, and compliance certifications to meet data security and regulatory requirements.
Overall, Amazon Redshift enables organizations to analyze large volumes of data quickly and cost-effectively, driving insights and decision-making processes with its scalable and high-performance data warehouse solution.
The Amazon Redshift data warehouse comprises of a Leader Node, one or more Compute or Slave Nodes and Storage disk.
Redshift warehouse is built on a distributed processing architecture.
Role of Leader Node:
Receives queries from clients
Develops query execution plan
Coordinates with Compute Nodes for parallel execution of the execution plan
Aggregates the results returned by Compute Nodes and return the result to the client
Role of Compute Node
Responds to Leader Node’s instructions
Processes its portion of execution plan
Interacts with Redshift managed storage disks
Returns the results to Leader Node
When launching a cluster, we need to provide node type. The node type determines CPU, Ram, storage capacity and storage drive type for each node
Amazon recommends using RA3 or DC2 node type
RA3 nodes with managed storage enable you to optimize your data warehouse by scaling and paying for compute and managed storage independently.
Features of Amazon Redshift
•Fast fully managed, cost-effective to analyze data using complex SQL and BI tools on Terabytes to petabytes of data
•Also includes Redshift Spectrum allowing us to execute SQL queries on large volume of structured and unstructured data in S3 and joining them with data stored in Redshift
•Easy to scale up and down
•Automated patching and backup
•Automatic recovery from component and node failure
•Security – column level access, encryption, IAM roles, federated access for SSO
•No upfront investment needed to setup Redshift cluster with ability to expand the cluster and data size grows
What is AQUA (Advanced Query Accelerator) – it is a distributed hardware-cache that enables Redshift to process data 10x faster than other data warehouses. It brings the compute to storage as against moving data to compute in traditional data warehouse
Redshift Spectrum – uses Redshift SQL endpoint which generates query plan taking into consideration local data in Redshift and external data on S3.
Redshift Spectrum works out of shared resource pool
Kinesis Data Analyticscan load data into Redshift using Lambda function or Firehose
RA3 Node Type
Managed Storage – The Redshift Node type RA3 provides managed storage feature
Allows customer to scale and pay independently for Compute and Storage
Uses SSD based local tier-1 storage for cache
A cluster using DS or DC nodes can be upgraded to RA3
Capacity
RA3.16X up to 2-128 nodes, each storing 64 TB. RA3.16X capacity – 8 Petabytes, RA3.4x – 4 Petabytes, RA3.xplus – 1 PB capacity
DC – 160 GB SSD, 1 – 128 nodes – 326 TB. You can start with $0.25/hour and expand to $1000/TB/Year
DS Large – 2 TB of magnetic storage, DS Extra Large with 16 TB of magnetic storage
How to create multi node Redshift Cluster
Redshift Console -> mention AZ, no of nodes, node type, a master name and password, security groups, backup and retention options, and other system settings. -> Create
When to use Redshift, Redshift Spectrum or EMR]
EMR – when custom code is used to process large volume of complex structured and unstructured data
Redshift – process structured data pulled from different business applications for reporting and business intelligence. Perform complex query on large volume of structured and semi structured data
Redshift Spectrum – you can use Redshift Spectrum with EMR
Redshift Spectrum uses the same approach as EMR to store metadata
Can support Hive Metastore that is used in EMR
If you already have EMR, then, you can query the same data using Redshift Spectrum without interfering with EMR processing
Redshift API
The Redshift Data API doesn't require a persistent connection to the cluster. Instead, it provides a secure HTTP endpoint and integration with AWS SDKs.
You can use the endpoint to run SQL statements without managing connections.
Calls to the Data API are asynchronous.
The Data API uses either credentials stored in AWS Secrets Manager or temporary database credentials. You don't need to pass passwords in the API calls with either authorization method
You can authorize a user to access the Data API by adding a managed policy, which is a predefined AWS Identity and Access Management (IAM) policy,
Manifest File
Manifest file is needed when we load data from from remote host.
Best Practice
Keep your Cluster Clean (Vacuum and Analyze)
Run VACUUM command and then Analyze everyday
Vacuum is heavy on CPU and Memory, hence it should be executed during off hours
Vacuum will clean deleted data and Analyze will update metadata
Workload Management Queue Configuration
Keep enough space to run queries
Avoid UPDATES if possible
Use COPY (batches) instead of Inserts
Redshift Spectrum
Enables execution of SQL queries against very high volume of unstructured data in S3 with no additional ETL loading of data
The query is processed by Athena Redshift SQL endpoint that generates and optimizes the query plan.
Redshift Spectrum Workers from shared resource pool read and process data from S3
Redshift Spectrum scales out to thousands of instances depending on workload requirement
Redshift Spectrum separates storage and compute. Each can be scaled separately
You can setup as many Redshift Clusters as needed to process S3 data
Provides High Availability and Concurrency of Redshift Cluster instances to process S3 data
Athena vs Redshift Spectrum
Athena is used for executing ad-hoc queries on data in S3. It is serverless.
Redshift Spectrum provides ability to extend Redshift queries for data stored in Redshift to Data in S3 lake
How is Redshift Different from Traditional Data Warehouses
Columnar data storage
Advanced compression
MPP
Redshift Spectrum
Materialized views
Scalability
Billing
Compute Node hours – for billing purpose, the Compute Node hours are total number of hours you run across all compute nodes. Node usage are billed at hourly rate. You must terminate the cluster to avoid being charged
Managed storage – fixed GB-month rate for the region. Use of managed storage is calculated hourly based on total data present in managed storage
Backup storage – there is no backup storage charge for storage provisioned for Redshift. But if backup retention is needed after termination of Redshift, then, there will be storage cost for it.
Data Transfer - there is no data transfer charge for data transferred between Redshift and S3 in the same region. You will be billed for all other data transfer charges
Data Scanned – charged for S3 data scanned by Redshift Spectrum to perform query. There are no charges for Redshift Spectrum if it is not performing any query
Data Loading
You can load data from S3, EMR, DynamoDB, RDS databases, Redshift Spectrum, Glue, any SSH connection EC2 or on-premise, Data Pipeline and SQL Insert statement from JDBC/ODBC connection
Data load using COPY command is executed in parallel using compute nodes and is must faster.
COPY command can be used to load data from S3, DynamoDB, EMR and SSH enabled hosts
Loading data using INSERT is single threaded and is much slower
Redshift Spectrum can be used to load parquet or other unstructured data by converting to structured format for Redshift. But there are charges for using Redshift Spectrum
Amazon Data Pipeline provides high performance and reliable process to load data from variety of sources including RDS databases to Redshift
Glue jobs can also be used to load data into Redshift.
Loading Data From EMR – no need to create EMRFS
Create IAM Role
Add Redshift Pubic Key to EC2 instances authorized key files
Configure hosts to accept all Redshift IP addresses
Use Redshift COPY command
Best Practices
Always run VACUUM during off hours
Run ANALYZE after VACUUM
https://source.coveo.com/2017/09/15/redshift-best-practices2/
WLM - allows you to assign more memory for critical queries, how many queries can run concurrently, set a higher timeout for longer queries
Keep enough space on Disk for query to run smoothly
Try to avoid UPDATES
Use COPY to load data
Security
Amazon Redshift supports the following features
IAM integration
Identity Federation for SSO
MFA
Column level access control
VPC
KMS integration to protect data in transit and at rest
SSL enabled connections between clients and Redshift
Redshift encrypts each data block using hardware-accelerated AES-256
Encrypts everything written to disk including intermediate query results
You can choose to manage your own key
Redshift Spectrum supports S3 SSE
SSO support – corporate AD, Federation Service, Okta, Ping Federate, SAML compliant identity providers
Direct access to Redshift compute node is not permitted
Redshift cluster is available in a single Availability Zone. It is not available in multi AZ.
the cluster is unavailable when the AZ is unavailable.
Restore from any existing snapshots to a new AZ in the same region can be done. Amazon restores most frequently accessed data first.
Backup and Restore
Replicates all data within the data warehouse and, also continuously backups data in S3
Always maintains 3 copies of data
You can configure Redshift to automatically copy both manual or automated snapshots to another region.
By default, enables automated backups of the whole cluster with 1 day of retention, It can be increased to 35 days.
Free backup storage is limited to total size of storage on the nodes and only applies to active clusters.
Additional charges are incurred when retention period is increased beyond 1 day
You can choose a backup version for restoration when needed
All previously created manual snapshots are retained and storage is charged even after the cluster is deleted
Scalability
You can increase the number of nodes using Elastic Resize
Changes are impacted immediately
Metrics are available in Console and CloudWatch API. Detailed metrics can be enabled in CloudWatch
Concurrency Scaling – Amazon automatically adds transient capacity to handle load spikes
With Concurrency Scaling, you can have virtually unlimited concurrent users and concurrent queries
Amazon automatically directs queries to transient clusters
Each account gets up to 1 hour of credit per day to use transient capacity
Elastic Resize – it is different from Concurrency Scaling.
It adds nodes to a single cluster within minutes to manage its query throughput
Concurrency Scaling adds additional clusters to handle large number of concurrent queries and directs to queries to additional nodes.
With Redshift Spectrum, you can have multiple clusters for processing S3 data.
You can use different clusters for different use cases – one cluster for regular queries and one for data science
Concurrence Level
Amazon Redshift WLM creates query queues at runtime according to service classes, which define the configuration parameters for various types of queues, including internal system queues and user-accessible queues.
From a user perspective, a user-accessible service class and a queue are functionally equivalent.
With manual WLM, Amazon Redshift configures one queue with a concurrency level of five, which enables up to five queries to run concurrently,
You can define up to eight queues.
Each queue can be configured with a maximum concurrency level of 50.
The maximum total concurrency level for all user-defined queues (not including the Superuser queue) is 50.
The easiest way to modify the WLM configuration is by using the Amazon Redshift Management Console. You can also use the Amazon Redshift command line interface (CLI) or the Amazon Redshift API
Currently, the default for clusters using the default parameter group is to use automatic WLM. Automatic WLM manages query concurrency and memory allocation
External Schema and Table
Redshift Spectrum uses external table to access data located outside of Redshift clusters
External table metadata can be created in Redshift, Athena Data Catalog, AWS Glue, or Hive metastore
For external tables metadata in Glue, Athena or Hive metastore, you must first create an external schema that references external database
To define external table in external schema, you must be owner of the external schema or be a superuser
To run a Redshift Spectrum query, following permissions are needed:
Usage permission on schema (grant usage on schema <schema name> to group <group name>
Permission to create temporary table in the schema (grant temp on database <database name> to group <group name>
You must qualify the external table name with schema name for table located in external schema
To define external table in Redshift Data Catalog, use CREATE EXTERNAL table command
You must give path and table properties (‘numows’ = ‘<n>’);
Redshift keeps path and size information in pseudo columns $path and $size.
Partitioning Redshift Spectrum External table
Keep data in S3 in folders as per partition key in the bucket
Define table with PARTITIONED BY clause, ADD PARTITON for adding new partitions
Redshift Data Distribution
A Redshift Table is distributed across multiple nodes - goal is to minimize data movement during query execution
There are 4 ways of data distribution in the nodes and slices within the nodes
AUTO -
Redshift assigns a distribution style based on size of data when distribution is not mentioned at the time of table creation.
It can be EVEN, KEY or ALL
EVEN - Rows are distributed evenly across the slices in round robin fashion
KEY -
Rows are distributed based on values of a column . Rows with matching column values are physically stored together
ALL -
the whole table is replicated to every node.
It is only appropriate for relatively low accessed tables.
It is not possible to change distribution after the table is created
PgClassInfo view provides distribution of the table,
Reflective diskStyle columns provides the current distribution
Sort Keys
Similar to index in relational databases
Max and Min value of sort key for a block is stored as metadata. It helps Redshift in quickly accessing required block when executing a query
Rows are stored in sorted order of value of the column selected for sort key
You can define one or more columns as Sort Key
Faster execution of Range queries
How to choose Sort Key
Receny based (recently added records, ex: timestamp) - query preformed recently, use timestamp based column
Filtering on a column - specify the column
Joins - columns used in joins
Types of Sort Keys - Single, compound and interleaved
Single Sort Key - sort key defined on single column - example queries mostly performed on date key or customer ID
Compound Sort Key - multiple columns that are used in joins and filters
Default type
Good for compression
Not good when the query depends on secondary sort column
•Interleaved -
Equal weight to each column in the Sort Key
Good when different columns are used in joins
Good when when query depends on secondary sort column
Integration with Other AWS Services
S3 - using COPY command you can load data from S3
DynamoDB - using COPY Command you can load single Dynamo DB table into Redshift
EMR / EC2 - Import data into Redshift via COPY command
Data Pipeline -
Database Migration Service
Every COPY command requires at least 3 parameters
Source
Target Redshift table and
Authorization (secret access key and access key)
Enhanced VPC - will force all data movement between Redshift and S3 and others through a VPC. It is much safer as otherwise, data will be copied over internet
Setup NAT Gateway or VPC End Point to copy data
Redshift WorkLoad Management - Short Fast running queries are NOT stuck behind long running queries
Helps in prioritizing Work Loads
Query prioritization based on their nature
Query queues based on Service Classes. There are available queues.
Separate keys for long running queries and separate keys for short running queries
Can be set via Console, CLI or API
VACUUM Command - recovers space from deleted rows and restore Sort order
Vacuum Full - default, re-sorts all rows and reclaims space
Vacuum Delete Only - no Sorting, reclaims space
Vacuum Sort Only - only Sort no reclaim
Vacuum Reindex - reanalyzing interleave keys and then performs SORT
Cost
When you select your instance type consider that Amazon Redshift compresses data up to four times
Amazon Redshift RA3 nodes with managed storage enable you to optimize your data warehouse by scaling and paying for compute and managed storage independently. With RA3, you choose the number of nodes based on your performance requirements and pay only for the managed storage that you use.
When you start using Amazon Redshift for the first time, you will receive a recommendation for the best node type based on your needs.
For workloads that require ever-growing storage,managed storagelets you automatically scale your data warehouse storage capacity without adding and paying for additional nodes.
Amazon Redshift RA3 nodes with managed storage enable you to optimize your data warehouse by scaling and paying for compute and managed storage independently.
With RA3, you choose the number of nodes based on your performance requirements and pay only for the managed storage that you use. You should size your RA3 cluster based on the amount of data you process daily.
Amazon Redshift managed storage uses large, high-performance SSDs in each RA3 node for fast local storage and Amazon Simple Storage Service (Amazon S3) for longer-term durable storage.
If the data in a node grows beyond the size of the large local SSDs, Amazon Redshift managed storage automatically offloads that data to Amazon S3.
You pay the same low rate for Amazon Redshift managed storage regardless of whether the data sits in high-performance SSDs or Amazon S3
Previous generation nodes include DC2 (Compute intensive), DS2 (Storage Intensive). Reserved instances (RI) (also called reserved nodes in the Amazon Redshift console) can provide up to 75% savings vs on-demand pricing
Trusted Advisor
Redshift Reserve Node Optimization - recommends a different reserve node for cost optimization
Underutilized Node - checks for CPU utilization below 5% for 99% of last 7 days
AWS Cost Explorer - helps to visualize, understand and manage AWS costs and usage over time across linked accounts, regions, user groups and tags for last 12 months. You can create - a) Budget, b) Cost and usage report and c) recommendation for purchasing reserved instances
Amazon Redshift Advisor
Improving operation cost
Improving performance
Table Compression - for cost compression, the first thing you should focus on is Table Compression
Table Compression - for cost compression, the first thing you should focus on is Table Compression
Amazon Redshift Advisor tracks uncompressed tables and offers recommendation
Advisor identifies uncompressed data being loaded using COPY command and recommends compression prior to COPY to Redshift. This reduces data transfer cost
The COPY command with COMPUPDATE to ON analyses data and compresses when found uncompressed
You can use Analyze Compression command without loading data to determine compression need of source data
Pause and Resume feature of Redshift - when the cluster is turned off, no compute cost is incurred but storage cost is still incurred. Development environment can be paused when not in use to save cost
Cluster Resize - a) Elastic Resize b) Classic Resize
Elastic Resize - optimum and fastest way to quickly add or remove nodes or change node types it automates the steps of cluster creation - taking snapshot, creating a new cluster, deleting the old cluster and renaming the new cluster to old name.
This process can be used to change DC2 or DS2 type cluster to RA3 type cluster
Data is redistributed across the nodes
It takes only 10 to 15 minutes to to complete the resize
Cluster Resize - a) Elastic Resize b) Classic Resize
Classic Resize - is a manual approach to resize a cluster
Elastic Resize is recommended approach
Concurrency Scaling - used for spikes
Cluster Resize is used for Known Workload, whereas Concurrency Resize is used for cluster with load spikes
Concurrency Scaling is cost effective feature to use Concurrency nodes only when there is demand spike
Each cluster earn up to 1 hours of concurrency credit per day
To implement Concurrency Scaling, the user will route queries to concurrency scaling clusters by enabling a Workload Manager queue as Concurrency Scaling Queue.
Redshift Workload Management
Enables users to flexibly manage priorities within workloads so that short, fast-running queries won't get stuck in queues behind long-running queries.
When you run a query, WLM assigns the query to a queue according to the user's user group or by matching a query group that is listed in the queue configuration with a query group label that the user sets at runtime.
With manual WLM, Amazon Redshift configures one queue with a concurrency level of five, which enables up to five queries to run concurrently,
plus one predefined Superuser queue, with a concurrency level of one.
You can define up to eight queues.
Each queue can be configured with a maximum concurrency level of 50.
The maximum total concurrency level for all user-defined queues (not including the Superuser queue) is 50.