Josh Peng

Full-Stack Data Scientist

github linkedin email
Running out of space on your Redshift cluster sucks. Compression helps
Aug 23, 2018
2 minutes read

It’s not clear to me why Redshift doesn’t just automatically make tables with compression, but if you need more space, you can likely compress one of your big tables to get the space you need.

  1. Figure out which compression algorithm to use on each column

    analyze compression my_schema.my_table
  2. Create a new temporary table based on the recommended compression algorithms

    create table my_schema.my_table_new
    (
        col_1 int encode zstd,
        col_2 int encode zstd,
        col_3 int encode lzo
    )
    distkey(some_column),
    sortkey(some_column)
  3. Copy all contents from the original table to the temp table

    insert into my_schema.my_table_new (select * from my_schema.my_table)

    You will be missing any new rows that landed in my_schema.my_table during this process. We’ll address those later.

  4. Grant permissions to the temp table that any ingestor or loader may need

    grant all on all tables in schema my_schema to some_ingestor_account;
  5. Swap the table names. This allows the ingestor to load new data into our compressed table now.

    alter table my_schema.my_table rename to my_table_old;
    alter table my_schema.my_table_new to my_table;
  6. Copy any missing rows left over from the old table. Depending on how your tables are setup, you may need to adjust the logic to determine which rows are missing

    insert into my_schema.my_table
    ( -- Grab only missing rows from the old table
    select o.*
    from my_schema.my_table_old o
    left join my_schema.my_table n on o.event_uuid = n.event_uuid
    where n.event_uuid is null)
  7. When everything is looking good you can drop the old table

    drop table my_schema.my_table_old

Tags: #redshift

Back to posts


comments powered by Disqus