Cogs and Levers A blog full of technical stuff

Range generation in Redshift

A little while ago, I’d written an article on Range generation in PostgreSQL, and whilst Redshift and PostgreSQL are similar database beasts, they are quite different in terms of their query capabilities due to their different strengths.

In today’s article, we’ll walk through a couple of different solutions to range generation.

generate_series

The generate_series is a pretty serious tool in terms of achieving exactly what this article is talking about. Unfortunately, there is no analog for this function within the Redshift database management system. Who knows; maybe one day.

Until then though, we need to create our own.

The query

with digit as (
    select 0 as d union all
    select 1 union all select 2 union all select 3 union all
    select 4 union all select 5 union all select 6 union all
    select 7 union all select 8 union all select 9
),
seq as (
    select a.d + (10 * b.d) + (100 * c.d) + (1000 * d.d) as num
    from digit a
        cross join
        digit b
        cross join
        digit c
        cross join
        digit d
    order by 1
),
drange as (
    SELECT (getdate()::date - seq.num)::date       as start_date
    FROM seq
    LIMIT 20
)
SELECT start_date
FROM drange;

There’s a bit going on in here.

The process starts with the digit CTE. Is does nothing more than counts from 0, up to 9.

select 0 as d union all
select 1 union all select 2 union all select 3 union all
select 4 union all select 5 union all select 6 union all
select 7 union all select 8 union all select 9

The next CTE seq uses the digits CTE, 4 times. Each time its treated as an increasing power of 10. a handles the 10^0, b handles the 10^1, etc. This catersian effect quickly multiplies our 0-9 range up to 0-9999. If more numbers are required, we simple add another CROSS JOIN, and treat the next power of 10.

select a.d + (10 * b.d) + (100 * c.d) + (1000 * d.d) as num
from digit a
    cross join
    digit b
    cross join
    digit c
    cross join
    digit d
order by 1

That’s it as far a number generation. It’s pretty easy.

I’ve seen other developers materialise this information into a physical table, and simply join to that table also.

Extending into time

Applying this into some date ranges now, we use the seq CTE and a little date arithmatic. My example here just uses the numbers within seq to enumerate dates. Adding LIMIT to this query also prevents us from utilising all 10000 values produced by seq.

SELECT (getdate()::date - seq.num)::date       as start_date
FROM seq
LIMIT 20

And that it.

Daemon development

In today’s article, I’ll take you through some basic daemon development for POSIX based systems. This article should give you enough information to get started on your own project to add in the bits that your program does.

Getting started

The largest amout of effort into writing any daemon is validation. Validation of incoming data is one thing; but we’re not even there yet. We’re just talking about validating the current system state, environment, and any configuration values.

fork

First job is to use the fork system call, to duplicate the current process.

fork() creates a new process by duplicating the calling process. The new process is referred to as the child process. The calling process is referred to as the parent process.

Directly after forking from the parent process, the parent is terminated; leaving the child to own the running of the daemon. This allows the daemon to now be independent of any other process running. Removing this dependency means that no one (or nothing) can kill your daemon by killing another (parent) process.

A secondary fork is then invoked; usually after a call to setsid. This starts a new session with no terminal.

setsid() creates a new session if the calling process is not a process group leader. The calling process is the leader of the new session (i.e., its session ID is made the same as its process ID). The calling process also becomes the process group leader of a new process group in the session (i.e., its process group ID is made the same as its process ID).

The side-effect that we’re looking for here.

Initially, the new session has no controlling terminal.

The child is then exited, removing any reliance on the terminal.

The grandchild is now the daemon.

pid_t pid;

/* the initial fork */
pid = fork();

/* terminate the parent */
if (pid == -1) {
  die("Failed to fork (errno=%d)", errno);
} else if (pid != 0) {
  exit(EXIT_SUCCESS);
}

/* make the child to session-leader */
setsid();

/* SIGHUP now gets ignore, and we re-fork */
signal(SIGHUP, SIG_IGN);
pid = fork();

/* terminate the child */
if (pid == -1) {
  die("Failed to fork (errno=%d)", errno);
else if (pid != 0) {
  exit(EXIT_SUCCESS);
}

/* we are now the grandparent
   this is the daemon */

Working directory

Now that we have a safe running process, it’s time to change to a location that we know is safe. We know that the root directory exists. We’ll use it for our example.

if (chdir("/") == -1) {
  die("Unable to change working directory (errno=%d)", errno);
}

This puts us in a sane place, where we know where we are. Doesn’t need to be the root directory; but at least we know where we are.

File mask

Next up, we’ll set the umask for the process to 0. It it standard practice for a daemon to operate with a umask of 0. At the operating system level, this forces new file objects created to have permission of 0666 (world-writable); and directories as 0777.

umask() sets the calling process’s file mode creation mask (umask) to mask & 0777 (i.e., only the file permission bits of mask are used), and returns the previous value of the mask.

In short, umask of 0 means objects are created with privileges initially revoked.

umask(0);

Standard File Descriptors

Daemons now close and re-open the standard file handles. STDIN, STDOUT, and STDERR all get closed. It’s assumed that /dev/null is provided on any POSIX system, and as a result it’s used to re-open these handles.

close(STDIN_FILENO);
close(STDOUT_FILENO);
close(STDERR_FILENO);

if (open("/dev/null", O_RDONLY) == -1) {
  die("Unable to re-open STDIN (errno=%d)", errno);
}

if (open("/dev/null", O_WRONLY) == -1) {
  die("Unalb to re-open STDOUT (errno=%d)", errno);
}

if (open("/dev/null", O_RDWR) == -1) {
  die("Unable to re-open STDERR (errno=%d)", errno);
}

All together

Your daemonize process might now look something like this:

#include <errno.h>
#include <signal.h>
#include <fcntl.h>
#include <unistd.h>

void daemonise() {
  /* 1. Initial fork */
  pid_t pid = fork();
  if (pid == -1) {
    die("Unable to fork (errno=%d)", errno);
  } else if (pid != 0) {
    _exit(0);
  }

  /* 2. Start a new session */
  if (setsid()==-1) {
    die("Unable to become a session leader (errno=%d)", errno);
  }

  /* 3. Fork again */
  signal(SIGHUP, SIG_IGN);
  pid = fork();

  if (pid == -1) {
    die("Unable to fork (errno=%d)",errno);
  } else if (pid != 0) {
    _exit(0);
  }

  /* 4. Set the working directory */
  if (chdir("/") == -1) {
    die("Unable to change the working directory (errno=%d)",errno);
  }

  /* 5. Set the umaask */
  umask(0);

  /* 6. Deal with the file descriptors */
  close(STDIN_FILENO);
  close(STDOUT_FILENO);
  close(STDERR_FILENO);

  if (open("/dev/null", O_RDONLY) == -1) {
    die("Unable to re-open stdin (errno=%d)", errno);
  }
  if (open("/dev/null", O_WRONLY) == -1) {
    die("Unable to re-open stdout (errno=%d)",errno);
  }
  if (open("/dev/null", O_RDWR) == -1) {
    die("Unable to re-open stderr (errno=%d)",errno);
  }
}

Closing up

You’re now daemonised, and ready to start actually writing your daemon process.

Some points to note might be installing some signal handlers?

signal(SIGCHLD, SIG_IGN);
signal(SIGTSTP, SIG_IGN);
signal(SIGTTOU, SIG_IGN);
signal(SIGTTIN, SIG_IGN);
signal(SIGHUP, daemon_signalhandler);
signal(SIGTERM, daemon_signalhandler);

Redshift variable hack

Being a data warehouse, Redshift from AWS doesn’t provide the same level of programming platform that you may have had the luxury of using.

The concept of variables in Redshift is non existant however with this clever hack you can fake your way into some variable based information.

WITH vars AS (
  SELECT 'John Smith' AS username,
         '2019-01-01T00:00:00.000Z'::timestamp AS process_date
)
SELECT *
FROM user_processsing
WHERE user == (SELECT username from vars) AND
      date_of_process = (SELECT process_date from vars);

The same trick can be achieved using CREATE TEMP TABLE also. The temporary table version will allow you to use your variables outside of the context of that one query. As above, the variables are bound to the CTE feeding the query.

PostgreSQL Disk Usage

Introduction

The following article is a bookmark of interesting queries used when discovering disk usage properties of any PostgreSQL databases. All of the queries listed here can be found in the PostgreSQL Wiki

General Table Size

This query will list out each of the tables in a given database, and show you both raw and pretty presented usage values.

SELECT *, pg_size_pretty(total_bytes) AS total
    , pg_size_pretty(index_bytes) AS INDEX
    , pg_size_pretty(toast_bytes) AS toast
    , pg_size_pretty(table_bytes) AS TABLE
  FROM (
  SELECT *, total_bytes-index_bytes-COALESCE(toast_bytes,0) AS table_bytes FROM (
      SELECT c.oid,nspname AS table_schema, relname AS TABLE_NAME
              , c.reltuples AS row_estimate
              , pg_total_relation_size(c.oid) AS total_bytes
              , pg_indexes_size(c.oid) AS index_bytes
              , pg_total_relation_size(reltoastrelid) AS toast_bytes
          FROM pg_class c
          LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
          WHERE relkind = 'r'
  ) a
) a;

Largest Database

Calcululates the size of each database; presenting databases that the user can not access as infinite in size.

SELECT d.datname AS Name,  pg_catalog.pg_get_userbyid(d.datdba) AS Owner,
    CASE WHEN pg_catalog.has_database_privilege(d.datname, 'CONNECT')
        THEN pg_catalog.pg_size_pretty(pg_catalog.pg_database_size(d.datname))
        ELSE 'No Access'
    END AS SIZE
FROM pg_catalog.pg_database d
    ORDER BY
    CASE WHEN pg_catalog.has_database_privilege(d.datname, 'CONNECT')
        THEN pg_catalog.pg_database_size(d.datname)
        ELSE NULL
    END DESC -- nulls first
    LIMIT 20

Relations

The size of the tables in your database, broken down into their specific parts:

SELECT nspname || '.' || relname AS "relation",
    pg_size_pretty(pg_relation_size(C.oid)) AS "size"
  FROM pg_class C
  LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
  WHERE nspname NOT IN ('pg_catalog', 'information_schema')
  ORDER BY pg_relation_size(C.oid) DESC
  LIMIT 20;

Data manipulation at the shell

Introduction

From time to time, you’ll find that some tasks could be easily achieved at the command line if you just had that one tool that you could slot in. In today’s article, I’ll take you through a few common data manipulation/mangling tools that should get you pretty productive.

output the first part of files

The head command will allow you to peek into a file. This is really handy when you are dealing with huge files, and you only want to sample the first n lines (or chars).

head myfile.txt

# view the first 10 characters of a file
head -c 10 myfile.txt

# view the first 10 lines of a file
head -n 10 myfile.txt

tail

output the last part of files

The tail command will allow you to sample the end of a file. tail works as head’s compliment. The --follow/-f switch is very handy with the tail command. When a file is still being written to, --follow will allow you to continaully stream the latest bytes being written to a file as they arrive.

tail myfile.txt

# view the last 10 characters of a file
tail -c 10 myfile.txt

# view the last 10 lines of a file
tail -n 10 myfile.txt

# follow the output of a file 
tail -f myfile.txt

iconv

convert text from one character encoding to another

Being able to change the character encoding of files that you’re working on can simply your processing greatly. By only needing to deal with a single encoding, you can remove this class of issue from your pipeline. A more comprehensive writeup on iconv can be found here.

# convert a file from ascii to unicode
iconv -f ascii -t unicode a-test-file > a-test-file.unicode

tr

translate or delete characters

tr will allow you to translate your input in such a way that you can cleanse information. Translate, squeeze, and/or delete characters as the documentation says.

# replace the spaces with tab characters
echo "All spaced out" | tr [:space:] '\t'

The [:space:] identifier user here is a special class identifier. There are support for others, too.

Identifier Description
[:alnum:] all letters and digits
[:alpha:] all letters
[:blank:] all horizontal whitespace
[:cntrl:] all control characters
[:digit:] all digits
[:graph:] all printable characters, not including space
[:lower:] all lower case letters
[:print:] all printable characters, including space
[:punct:] all punctuation characters
[:space:] all horizontal or vertical whitespace
[:upper:] all upper case letters
[:xdigit:] all hexadecimal digits
[=CHAR=] all characters which are equivalent to CHAR

wc

print newline, word, and byte counts for each file

Takes the input and counts things.

# count the number of bytes in a file
wc -c myfile.txt

# count the number of lines
wc -l myfile.txt

split

split a file into pieces

split takes a file, and cuts it into smaller pieces. This is really handy when your input file is massive; cutting the job down into smaller pieces gives you the chance to parallelize this work appropriately.

split -l 100 contacts.csv contact-

sort

sort lines of text files

The sort command will allow you to sort a text file by any column, in a couple of different ways.

# sort a csv by the 5th column, alpha
sort -t"," -k5,5 contacts.csv

# sort a csv by the 3rd column, numerically
sort -t"," -k3n,3 contacts.csv

# sort a csv by the 8th column, numberically reverse
sort -t"," -k8nr,8 contacts.csv

uniq

report or omit repeated lines

# show a unique list of names
cat names | uniq

cut

remove sections from each line of files

Cutting columns from your file can be useful if you need to trim information from your data source prior to moving to the next phase of your pipeline.

# remove the fifth column
cut -d, -f 5 contacts.csv

# remove columns 2-though-4
cut -d, -f 2-4 contacts.csv

paste

merge lines of files

The paste command takes multiple files, and links each line of data together.

# colours.txt
blue
red
orange

# sports.txt
swimming
cricket
golf

These values can be pasted together:

paste -d ',' colours.txt sports.txt

The output of which would look like this:

blue,swimming
red,cricket
orange,golf

join

join lines of two files on a common field

The join command will run a fairly basic INNER JOIN between two files. One column from each file will be chosen, and a strong join performed leaving you with the coninciding set.

# join contacts (col 5) on accounts (col 4)
join -t"," -1 5 -2 4 contacts.csv accounts.csv

grep, sed, and awk

Each of these commands really needs their own articles. They are full programming tools in their own right.

All of these are excellent tools to allow you to build complex processing pipelines from your console.