Distribution key redshift

6/30/2023

Now that I’ve convinced you, what compression to pick for your columns? ANALYZE COMPRESSION TABLE_NAME_HERE The exception to this is the columns you leverage as sort keys: if those are highly compressed, it’s more work to find the data on disk which means more IO.Ĭonfused? Amazon has a helpful workflow for deciding if you should or shouldn’t compress a column.

The reason is that more data can be pulled into memory, which means less IO needs to be done fetching more data as the query runs, thus better performance: narrow columns (ie tightly compressed columns) thus help work zip by. Redshift stores data by column, not by row, and by minimizing the size on disk of columns, you end up getting better query performance. I’m going to assume that you know what column types and sizes you want, and skip constraints as more advanced than this post is meant for, though consider those if you want. More on why you need to consider disk IO here. Instead it must be considered and finessed when the table is designed, so that data on disk is as optimized as it can be for the query that comes in.įor this reason, when designing your table it is advantageous to know what your most important query will be so that you can ensure the design of the table assists the query. Reading to them, writing to them - while Redshift tries to optimize queries as much as it can ( more on query performance tuning here), this required work cannot be optimized around at query execution time. On IO and performance hitsĭisks are slow. While having a query that requires no passing of data ever is highly unlikely, as there is a cost with keeping data pristine at all times, distributing the data in such a manner that we minimize the passing of data will allow the cluster to run efficiently and make best use of Redshift.

If a query being performed requires multiple nodes to share data across each other constantly, then it will take a lot more effort for the query to be executed and optimization may be needed. If a query can be performed in parallel by multiple nodes, then congratulations: your data has been distributed well! ( More on parallel processing here.) By allowing each compute node to work independently, better performance is achieved. If instead a query is operating on data from the full available timeline, all compute nodes will be needed and they may need to share data across themselves. For example, if a query is operating only on data from May of 2017, and all of that data is stored on a single compute node, the leader only needs that node to perform the work. The leader node helps orchestrate the work the compute nodes do. Within a Redshift cluster, there is a leader node and many compute nodes. More on Redshift database development here. The reasons for this and how to best avoid these inefficiencies are detailed below. And since the scale is so much larger, the impact of IO can go up considerably, especially if the cluster needs to move or share data to perform a query. The main difference between Redshift and most other databases you’ll have encountered is due to scale, with the cluster being important to keep in mind in table design along with standard table design considerations. The Redshift database will behave like other databases you’ve encountered, but under the hood it has some extra considerations to take into account. The vast majority of this post actually comes from our internal documentation, so you can trust that we do use this to help educate those less familiar with Redshift, and get them ramped up and feeling comfortable. This is by no means a comprehensive review of Redshift, as then it’d no longer be a crash course, nor does this dive into data warehousing specifics, which I can cover in another post if people want.Īt a high level what I’ll be covering is: This post is meant to give you a crash course into working with Redshift, to get you off and running until you have the time and resources to come back and internalize what it all means. It can also store insane amounts of surprises, considerations, new ideas to learn, skewed tables to fix, distributions to get in line, what’s a WLM, what am I doing‽

0 Comments

Distribution key redshift

Leave a Reply.

Author

Archives

Categories