top of page

Amazon Redshift Warehouse

Shashi Shankar

Mar 20, 2023

Introduction to Redshift

Amazon Redshift is a fully managed, petabyte-scale data warehouse service provided by Amazon Web Services (AWS). It is designed to handle large datasets and complex analytics workloads, making it suitable for data warehousing, business intelligence, and data analytics applications.


Key features of Amazon Redshift include:

  • Columnar Storage: Redshift stores data in a columnar format, which improves query performance by reducing I/O operations and optimizing data compression.

  • Massively Parallel Processing (MPP): It leverages a distributed architecture across multiple nodes to parallelize query execution, enabling high performance and scalability for data processing.

  • Advanced Compression: Redshift automatically compresses data and applies encoding techniques to minimize storage requirements and improve query performance.

  • Integration with AWS Ecosystem: Redshift seamlessly integrates with other AWS services such as Amazon S3, AWS Glue, AWS Lambda, and Amazon Kinesis for data ingestion, data transformation, and data processing.

  • SQL Compatibility: Redshift supports standard SQL queries, making it easy for users familiar with SQL to interact with the data warehouse and perform analytics tasks.

  • Concurrency and Workload Management: It provides features for managing concurrent user queries and workload prioritization, ensuring efficient resource utilization and performance optimization.

  • Security and Compliance: Redshift offers robust security features such as encryption at rest and in transit, access controls, audit logging, and compliance certifications to meet data security and regulatory requirements.


Overall, Amazon Redshift enables organizations to analyze large volumes of data quickly and cost-effectively, driving insights and decision-making processes with its scalable and high-performance data warehouse solution.


The Amazon Redshift data warehouse comprises of a Leader Node, one or more Compute or Slave Nodes and Storage disk.


  • Redshift warehouse is built on a distributed processing architecture.

  • Role of Leader Node:

  • Receives queries from clients

  • Develops query execution plan

  • Coordinates with Compute Nodes for parallel execution of the execution plan

  • Aggregates the results returned by Compute Nodes and return the result to the client

  • Role of Compute Node

  • Responds to Leader Node’s instructions

  • Processes its portion of execution plan

  • Interacts with Redshift managed storage disks

  • Returns the results to Leader Node

  • When launching a cluster, we need to provide node type. The node type determines CPU, Ram, storage capacity and storage drive type for each node

  • Amazon recommends using RA3 or DC2 node type

  • RA3 nodes with managed storage enable you to optimize your data warehouse by scaling and paying for compute and managed storage independently.


Features of Amazon Redshift

Fast fully managed, cost-effective to analyze data using complex SQL and BI tools on Terabytes to petabytes of data

•Also includes Redshift Spectrum allowing us to execute SQL queries on large volume of structured and unstructured data in S3 and joining them with data stored in Redshift

•Easy to scale up and down

•Automated patching and backup

•Automatic recovery from component and node failure

•Security – column level access, encryption, IAM roles, federated access for SSO

•No upfront investment needed to setup Redshift cluster with ability to expand the cluster and data size grows


What is AQUA (Advanced Query Accelerator) – it is a distributed hardware-cache that enables Redshift to process data 10x faster than other data warehouses. It brings the compute to storage as against moving data to compute in traditional data warehouse 


Redshift Spectrum – uses Redshift SQL endpoint which generates query plan taking into consideration local data in Redshift and external data on S3.

Redshift Spectrum works out of shared resource pool

Kinesis Data Analyticscan load data into Redshift using Lambda function or Firehose


RA3 Node Type

  • Managed Storage – The Redshift Node type RA3 provides managed storage feature

  • Allows customer to scale and pay independently for Compute and Storage

  • Uses SSD based local tier-1 storage for cache

  • A cluster using DS or DC nodes can be upgraded to RA3

  • Capacity

  • RA3.16X up to 2-128 nodes, each storing 64 TB. RA3.16X capacity – 8 Petabytes, RA3.4x – 4 Petabytes, RA3.xplus – 1 PB capacity

  • DC – 160 GB SSD, 1 – 128 nodes – 326 TB. You can start with $0.25/hour and expand to $1000/TB/Year

  • DS Large – 2 TB of magnetic storage, DS Extra Large with 16 TB of magnetic storage


How to create multi node Redshift Cluster


Redshift Console -> mention AZ, no of nodes, node type, a master name and password, security groups, backup and retention options, and other system settings. -> Create


When to use Redshift, Redshift Spectrum or EMR]


  • EMR – when custom code is used to process large volume of complex structured and unstructured data

  • Redshift – process structured data pulled from different business applications for reporting and business intelligence. Perform complex query on large volume of structured and semi structured data

  • Redshift Spectrum – you can use Redshift Spectrum with EMR

  • Redshift Spectrum uses the same approach as EMR to store metadata

  • Can support Hive Metastore that is used in EMR

  • If you already have EMR, then, you can query the same data using Redshift Spectrum without interfering with EMR processing


Redshift API

  • The Redshift Data API doesn't require a persistent connection to the cluster. Instead, it provides a secure HTTP endpoint and integration with AWS SDKs.

  • You can use the endpoint to run SQL statements without managing connections.

  • Calls to the Data API are asynchronous.

  • The Data API uses either credentials stored in AWS Secrets Manager or temporary database credentials. You don't need to pass passwords in the API calls with either authorization method

  • You can authorize a user to access the Data API by adding a managed policy, which is a predefined AWS Identity and Access Management (IAM) policy,


Manifest File

  • Manifest file is needed when we load data from from remote host.

  • Best Practice

  • Keep your Cluster Clean (Vacuum and Analyze)

  • Run VACUUM command and then Analyze everyday

  • Vacuum is heavy on CPU and Memory, hence it should be executed during off hours

  • Vacuum will clean deleted data and Analyze will update metadata

  • Workload Management Queue Configuration

    • Keep enough space to run queries

    • Avoid UPDATES if possible

    • Use COPY (batches) instead of Inserts


Redshift Spectrum

  • Enables execution of SQL queries against very high volume of unstructured data in S3 with no additional ETL loading of data

  • The query is processed by Athena Redshift SQL endpoint that generates and optimizes the query plan.

  • Redshift Spectrum Workers from shared resource pool read and process data from S3

  • Redshift Spectrum scales out to thousands of instances depending on workload requirement

  • Redshift Spectrum separates storage and compute. Each can be scaled separately

  • You can setup as many Redshift Clusters as needed to process S3 data

  • Provides High Availability and Concurrency of Redshift Cluster instances to process S3 data


Athena vs Redshift Spectrum

  • Athena is used for executing ad-hoc queries on data in S3. It is serverless.

  • Redshift Spectrum provides ability to extend Redshift queries for data stored in Redshift to Data in S3 lake



How is Redshift Different from Traditional Data Warehouses

  • Columnar data storage

  • Advanced compression

  • MPP

  • Redshift Spectrum

  • Materialized views

  • Scalability



Billing

  • Compute Node hours – for billing purpose, the Compute Node hours are total number of hours you run across all compute nodes. Node usage are billed at hourly rate. You must terminate the cluster to avoid being charged

  • Managed storage – fixed GB-month rate for the region. Use of managed storage is calculated hourly based on total data present in managed storage  

  • Backup storage – there is no backup storage charge for storage provisioned for Redshift. But if backup retention is needed after termination of Redshift, then, there will be storage cost for it.

  • Data Transfer  - there is no data transfer charge for data transferred between Redshift and S3 in the same region. You will be billed for all other data transfer charges

  • Data Scanned – charged for S3 data scanned by Redshift Spectrum to perform query. There are no charges for Redshift Spectrum if it is not performing any query


Data Loading

  • You can load data from S3, EMR, DynamoDB, RDS databases, Redshift Spectrum, Glue, any SSH connection EC2 or on-premise, Data Pipeline and SQL Insert statement from JDBC/ODBC connection

  • Data load using COPY command is executed in parallel using compute nodes and is must faster.

  • COPY command can be used to load data from S3, DynamoDB, EMR and SSH enabled hosts

  • Loading data using INSERT is single threaded and is much slower

  • Redshift Spectrum can be used to load parquet or other unstructured data by converting to structured format for Redshift. But there are charges for using Redshift Spectrum

  • Amazon Data Pipeline provides high performance and reliable process to load data from variety of sources including RDS databases to Redshift 

  • Glue jobs can also be used to load data into Redshift.

  • Loading Data From EMR – no need to create EMRFS

  • Create IAM Role

  • Add Redshift Pubic Key to EC2 instances authorized key files

  • Configure hosts to accept all Redshift IP addresses

  • Use Redshift COPY command


Best Practices

  • Always run VACUUM during off hours

  • Run ANALYZE after VACUUM

  • https://source.coveo.com/2017/09/15/redshift-best-practices2/

  • WLM - allows you to assign more memory for critical queries, how many queries can run concurrently, set a higher timeout for longer queries

  • Keep enough space on Disk for query to run smoothly

  • Try to avoid UPDATES

  • Use COPY to load data


Security

Amazon Redshift supports the following features

  • IAM integration

  • Identity Federation for SSO

  • MFA

  • Column level access control

  • VPC

  • KMS integration to protect data in transit and at rest

  • SSL enabled connections between clients and Redshift

  • Redshift encrypts each data block using hardware-accelerated AES-256

  • Encrypts everything written to disk including intermediate query results

  • You can choose to manage your own key

  • Redshift Spectrum supports S3 SSE

  • SSO support – corporate AD, Federation Service, Okta, Ping Federate, SAML compliant identity providers

  • Direct access to Redshift compute node is not permitted

  • Redshift cluster is available in a single Availability Zone. It is not available in multi AZ.

  • the cluster is unavailable when the AZ is unavailable.

  • Restore from any existing snapshots to a new AZ in the same region can be done. Amazon restores most frequently accessed data first.


Backup and Restore

  • Replicates all data within the data warehouse and, also continuously backups data in S3

  • Always maintains 3 copies of data

  • You can configure Redshift to automatically copy both manual or automated snapshots to another region.

  • By default, enables automated backups of the whole cluster with 1 day of retention, It can be increased to 35 days.

  • Free backup storage is limited to total size of storage on the nodes and only applies to active clusters.

  • Additional charges are incurred when retention period is increased beyond 1 day

  • You can choose a backup version for restoration when needed

  • All previously created manual snapshots are retained and storage is charged even after the cluster is deleted


Scalability

  • You can increase the number of nodes using Elastic Resize

  • Changes are impacted immediately

  • Metrics are available in Console and CloudWatch API. Detailed metrics can be enabled in CloudWatch

  • Concurrency Scaling – Amazon automatically adds transient capacity to handle load spikes 

  • With Concurrency Scaling, you can have virtually unlimited concurrent users and concurrent queries

  • Amazon automatically directs queries to transient clusters

  • Each account gets up to 1 hour of credit per day to use transient capacity

  • Elastic Resize – it is different from Concurrency Scaling.

  • It adds nodes to a single cluster within minutes to manage its query throughput

  • Concurrency Scaling adds additional clusters to handle large number of concurrent queries and directs to queries to additional nodes. 

  • With Redshift Spectrum, you can have multiple clusters for processing S3 data.

  • You can use different clusters for different use cases – one cluster for regular queries and one for data science


Concurrence Level

  • Amazon Redshift WLM creates query queues at runtime according to service classes, which define the configuration parameters for various types of queues, including internal system queues and user-accessible queues.

  • From a user perspective, a user-accessible service class and a queue are functionally equivalent.

  • With manual WLM, Amazon Redshift configures one queue with a concurrency level of five, which enables up to five queries to run concurrently,


  • You can define up to eight queues.

  • Each queue can be configured with a maximum concurrency level of 50.

  • The maximum total concurrency level for all user-defined queues (not including the Superuser queue) is 50.

  • The easiest way to modify the WLM configuration is by using the Amazon Redshift Management Console. You can also use the Amazon Redshift command line interface (CLI) or the Amazon Redshift API

  • Currently, the default for clusters using the default parameter group is to use automatic WLM. Automatic WLM manages query concurrency and memory allocation


External Schema and Table

  • Redshift Spectrum uses external table to access data located outside of Redshift clusters

  • External table metadata can be created in Redshift, Athena Data Catalog, AWS Glue, or Hive metastore

  • For external tables metadata in Glue, Athena or Hive metastore, you must first create an external schema that references external database

  • To define external table in external schema, you must be owner of the external schema or be a superuser

  • To run a Redshift Spectrum query, following permissions are needed:

  • Usage permission on schema (grant usage on schema <schema name> to group <group name>

  • Permission to create temporary table in the schema (grant temp on database <database name> to group <group name>

  • You must qualify the external table name with schema name for table located in external schema

  • To define external table in Redshift Data Catalog, use CREATE EXTERNAL table command

  • You must give path and table properties (‘numows’ = ‘<n>’);

  • Redshift keeps path and size information in pseudo columns $path and $size.

  • Partitioning Redshift Spectrum External table

  • Keep data in S3 in folders as per partition key in the bucket

  • Define table with PARTITIONED BY clause, ADD PARTITON for adding new partitions


Redshift Data Distribution

A Redshift Table is distributed across multiple nodes - goal is to minimize data movement during query execution

  • There are 4 ways of data distribution in the nodes and slices within the nodes

  • AUTO -

    Redshift assigns a distribution style based on size of data when distribution is not mentioned at the time of table creation.

    It can be EVEN, KEY or ALL

  • EVEN - Rows are distributed evenly across the slices in round robin fashion

  • KEY -

    Rows are distributed based on values of a column . Rows with matching column values are physically stored together

  • ALL -

    the whole table is replicated to every node.

    It is only appropriate for relatively low accessed tables.


  • It is not possible to change distribution after the table is created

  • PgClassInfo view provides distribution of the table,

  • Reflective diskStyle columns provides the current distribution


Sort Keys


  • Similar to index in relational databases

  • Max and Min value of sort key for a block is stored as metadata. It helps Redshift in quickly accessing required block when executing a query

  • Rows are stored in sorted order of value of the column selected for sort key

  • You can define one or more columns as Sort Key

  • Faster execution of Range queries

  • How to choose Sort Key

  • Receny based (recently added records, ex: timestamp)  - query  preformed recently, use timestamp based column

  • Filtering on a column - specify the column

  • Joins - columns used in joins

  • Types of Sort Keys - Single, compound and interleaved

  • Single Sort Key - sort key defined on single column - example queries mostly performed on date key or customer ID

  • Compound Sort Key - multiple columns that are used in joins and filters

  • Default type

    • Good for compression

    • Not good when the query depends on secondary sort column

  • •Interleaved -

    • Equal weight to each column in the Sort Key

    • Good when different columns are used in joins

    • Good when when query depends on secondary sort column


Integration with Other AWS Services


  • S3 - using COPY command you can load data from S3

  • DynamoDB - using COPY Command you can load single Dynamo DB table into Redshift

  • EMR / EC2 - Import data into Redshift via COPY command

  • Data Pipeline -

  • Database Migration Service

  • Every COPY command requires at least 3 parameters

    • Source

    • Target Redshift table and

    • Authorization (secret access key and access key)

  • Enhanced VPC - will force all data movement between Redshift and S3 and others through a VPC. It is much safer as otherwise, data will be copied over internet

  • Setup NAT Gateway or VPC End Point to copy data

  • Redshift WorkLoad Management - Short Fast running queries are NOT stuck behind long running queries

    • Helps in prioritizing Work Loads

    • Query prioritization based on their nature

    • Query queues based on Service Classes. There are available queues. 

  • Separate keys for long running queries and separate keys for short running queries

  • Can be set via Console, CLI or API 

  • VACUUM Command - recovers space from deleted rows and restore Sort order

    • Vacuum Full - default, re-sorts all rows and reclaims space

    • Vacuum Delete Only - no Sorting, reclaims space

    • Vacuum Sort Only - only Sort no reclaim

    • Vacuum Reindex - reanalyzing interleave keys and then performs SORT


Cost

  • When you select your instance type consider that Amazon Redshift compresses data up to four times

  • Amazon Redshift RA3 nodes with managed storage enable you to optimize your data warehouse by scaling and paying for compute and managed storage independently. With RA3, you choose the number of nodes based on your performance requirements and pay only for the managed storage that you use.

  • When you start using Amazon Redshift for the first time, you will receive a recommendation for the best node type based on your needs.

  • For workloads that require ever-growing storage,managed storagelets you automatically scale your data warehouse storage capacity without adding and paying for additional nodes.

    • Amazon Redshift RA3 nodes with managed storage enable you to optimize your data warehouse by scaling and paying for compute and managed storage independently.

    • With RA3, you choose the number of nodes based on your performance requirements and pay only for the managed storage that you use. You should size your RA3 cluster based on the amount of data you process daily.

    • Amazon Redshift managed storage uses large, high-performance SSDs in each RA3 node for fast local storage and Amazon Simple Storage Service (Amazon S3) for longer-term durable storage.

    • If the data in a node grows beyond the size of the large local SSDs, Amazon Redshift managed storage automatically offloads that data to Amazon S3.

    • You pay the same low rate for Amazon Redshift managed storage regardless of whether the data sits in high-performance SSDs or Amazon S3

  • Previous generation nodes include DC2 (Compute intensive), DS2 (Storage Intensive). Reserved instances (RI) (also called reserved nodes in the Amazon Redshift console) can provide up to 75% savings vs on-demand pricing

  • Trusted Advisor

    • Redshift Reserve Node Optimization - recommends a different reserve node for cost optimization

    • Underutilized Node - checks for CPU utilization below 5% for 99% of last 7 days

    • AWS Cost Explorer - helps to visualize, understand and manage AWS costs and usage over time across linked accounts, regions, user groups and tags for last 12 months. You can create - a) Budget, b) Cost and usage report and c) recommendation for purchasing reserved instances

    • Amazon Redshift Advisor

    • Improving operation cost

    • Improving performance

  • Table Compression - for cost compression, the first thing you should focus on is Table Compression

  • Table Compression - for cost compression, the first thing you should focus on is Table Compression

  • Amazon Redshift Advisor tracks uncompressed tables and offers recommendation

  • Advisor identifies uncompressed data being loaded using COPY command and recommends compression prior to COPY to Redshift. This reduces data transfer cost

  • The COPY command with COMPUPDATE to ON analyses data and compresses when found uncompressed

  • You can use Analyze Compression command without loading data to determine compression need of source data

  • Pause and Resume feature of Redshift  - when the cluster is turned off, no compute cost is incurred but storage cost is still incurred. Development environment can be paused when not in use to save cost

  • Cluster Resize - a) Elastic Resize b) Classic Resize

  • Elastic Resize - optimum and fastest way to quickly add or remove nodes or change node types it automates the steps of cluster creation - taking snapshot, creating a new cluster, deleting the old cluster and renaming the new cluster to old name.

  • This process can be used to change DC2 or DS2 type cluster to RA3 type cluster

  • Data is redistributed across the nodes

  • It takes only 10 to 15 minutes to to complete the resize

  • Cluster Resize - a) Elastic Resize b) Classic Resize

  • Classic Resize - is a manual approach to resize a cluster

  • Elastic Resize is recommended approach

  • Concurrency Scaling - used for spikes

  • Cluster Resize is used for Known Workload, whereas Concurrency Resize is used for cluster with load spikes

  • Concurrency Scaling is cost effective feature to use Concurrency nodes only when there is demand spike

  • Each cluster earn up to 1 hours of concurrency credit per day

  • To implement Concurrency Scaling, the user will route queries to concurrency scaling clusters by enabling a Workload Manager queue as Concurrency Scaling Queue.


Redshift Workload Management

  • Enables users to flexibly manage priorities within workloads so that short, fast-running queries won't get stuck in queues behind long-running queries.

  • When you run a query, WLM assigns the query to a queue according to the user's user group or by matching a query group that is listed in the queue configuration with a query group label that the user sets at runtime.

  • With manual WLM, Amazon Redshift configures one queue with a concurrency level of five, which enables up to five queries to run concurrently,

  • plus one predefined Superuser queue, with a concurrency level of one.

  • You can define up to eight queues.

  • Each queue can be configured with a maximum concurrency level of 50.

  • The maximum total concurrency level for all user-defined queues (not including the Superuser queue) is 50.




techiesubnet.com

bottom of page