Cogs and Levers A blog full of technical stuff

PostgreSQL Disk Usage

Introduction

The following article is a bookmark of interesting queries used when discovering disk usage properties of any PostgreSQL databases. All of the queries listed here can be found in the PostgreSQL Wiki

General Table Size

This query will list out each of the tables in a given database, and show you both raw and pretty presented usage values.

SELECT *, pg_size_pretty(total_bytes) AS total
    , pg_size_pretty(index_bytes) AS INDEX
    , pg_size_pretty(toast_bytes) AS toast
    , pg_size_pretty(table_bytes) AS TABLE
  FROM (
  SELECT *, total_bytes-index_bytes-COALESCE(toast_bytes,0) AS table_bytes FROM (
      SELECT c.oid,nspname AS table_schema, relname AS TABLE_NAME
              , c.reltuples AS row_estimate
              , pg_total_relation_size(c.oid) AS total_bytes
              , pg_indexes_size(c.oid) AS index_bytes
              , pg_total_relation_size(reltoastrelid) AS toast_bytes
          FROM pg_class c
          LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
          WHERE relkind = 'r'
  ) a
) a;

Largest Database

Calcululates the size of each database; presenting databases that the user can not access as infinite in size.

SELECT d.datname AS Name,  pg_catalog.pg_get_userbyid(d.datdba) AS Owner,
    CASE WHEN pg_catalog.has_database_privilege(d.datname, 'CONNECT')
        THEN pg_catalog.pg_size_pretty(pg_catalog.pg_database_size(d.datname))
        ELSE 'No Access'
    END AS SIZE
FROM pg_catalog.pg_database d
    ORDER BY
    CASE WHEN pg_catalog.has_database_privilege(d.datname, 'CONNECT')
        THEN pg_catalog.pg_database_size(d.datname)
        ELSE NULL
    END DESC -- nulls first
    LIMIT 20

Relations

The size of the tables in your database, broken down into their specific parts:

SELECT nspname || '.' || relname AS "relation",
    pg_size_pretty(pg_relation_size(C.oid)) AS "size"
  FROM pg_class C
  LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
  WHERE nspname NOT IN ('pg_catalog', 'information_schema')
  ORDER BY pg_relation_size(C.oid) DESC
  LIMIT 20;

Data manipulation at the shell

Introduction

From time to time, you’ll find that some tasks could be easily achieved at the command line if you just had that one tool that you could slot in. In today’s article, I’ll take you through a few common data manipulation/mangling tools that should get you pretty productive.

output the first part of files

The head command will allow you to peek into a file. This is really handy when you are dealing with huge files, and you only want to sample the first n lines (or chars).

head myfile.txt

# view the first 10 characters of a file
head -c 10 myfile.txt

# view the first 10 lines of a file
head -n 10 myfile.txt

tail

output the last part of files

The tail command will allow you to sample the end of a file. tail works as head’s compliment. The --follow/-f switch is very handy with the tail command. When a file is still being written to, --follow will allow you to continaully stream the latest bytes being written to a file as they arrive.

tail myfile.txt

# view the last 10 characters of a file
tail -c 10 myfile.txt

# view the last 10 lines of a file
tail -n 10 myfile.txt

# follow the output of a file 
tail -f myfile.txt

iconv

convert text from one character encoding to another

Being able to change the character encoding of files that you’re working on can simply your processing greatly. By only needing to deal with a single encoding, you can remove this class of issue from your pipeline. A more comprehensive writeup on iconv can be found here.

# convert a file from ascii to unicode
iconv -f ascii -t unicode a-test-file > a-test-file.unicode

tr

translate or delete characters

tr will allow you to translate your input in such a way that you can cleanse information. Translate, squeeze, and/or delete characters as the documentation says.

# replace the spaces with tab characters
echo "All spaced out" | tr [:space:] '\t'

The [:space:] identifier user here is a special class identifier. There are support for others, too.

Identifier Description
[:alnum:] all letters and digits
[:alpha:] all letters
[:blank:] all horizontal whitespace
[:cntrl:] all control characters
[:digit:] all digits
[:graph:] all printable characters, not including space
[:lower:] all lower case letters
[:print:] all printable characters, including space
[:punct:] all punctuation characters
[:space:] all horizontal or vertical whitespace
[:upper:] all upper case letters
[:xdigit:] all hexadecimal digits
[=CHAR=] all characters which are equivalent to CHAR

wc

print newline, word, and byte counts for each file

Takes the input and counts things.

# count the number of bytes in a file
wc -c myfile.txt

# count the number of lines
wc -l myfile.txt

split

split a file into pieces

split takes a file, and cuts it into smaller pieces. This is really handy when your input file is massive; cutting the job down into smaller pieces gives you the chance to parallelize this work appropriately.

split -l 100 contacts.csv contact-

sort

sort lines of text files

The sort command will allow you to sort a text file by any column, in a couple of different ways.

# sort a csv by the 5th column, alpha
sort -t"," -k5,5 contacts.csv

# sort a csv by the 3rd column, numerically
sort -t"," -k3n,3 contacts.csv

# sort a csv by the 8th column, numberically reverse
sort -t"," -k8nr,8 contacts.csv

uniq

report or omit repeated lines

# show a unique list of names
cat names | uniq

cut

remove sections from each line of files

Cutting columns from your file can be useful if you need to trim information from your data source prior to moving to the next phase of your pipeline.

# remove the fifth column
cut -d, -f 5 contacts.csv

# remove columns 2-though-4
cut -d, -f 2-4 contacts.csv

paste

merge lines of files

The paste command takes multiple files, and links each line of data together.

# colours.txt
blue
red
orange

# sports.txt
swimming
cricket
golf

These values can be pasted together:

paste -d ',' colours.txt sports.txt

The output of which would look like this:

blue,swimming
red,cricket
orange,golf

join

join lines of two files on a common field

The join command will run a fairly basic INNER JOIN between two files. One column from each file will be chosen, and a strong join performed leaving you with the coninciding set.

# join contacts (col 5) on accounts (col 4)
join -t"," -1 5 -2 4 contacts.csv accounts.csv

grep, sed, and awk

Each of these commands really needs their own articles. They are full programming tools in their own right.

All of these are excellent tools to allow you to build complex processing pipelines from your console.

RSocket for streaming applications

Introduction

In a world of changing software requirements, and more demand for results and analytics in realtime, framework and system creators have needed to become smarter with the way that they reason about information ingestion. Streaming frameworks have started to make good ground in establishing themselves as accessible software platforms for many developers.

In today’s article, we’ll explore RSocket very briefly. RSocket’s place is to provide an application protocol that is directly designed for reactive streaming applications.

Core Feature

The core features of implementing this specification are as follows:

  • Metadata and Payload frames
  • All 4 interaction models : Fire-and-forget, request/response, requestStream, requestChannel
  • Request-N frame : application level flow control
  • Fragmentation/Reassembly : Drivers are assumed to fully encode/decode the expected user data type
  • Keep-Alive frame : A responder receiving a keep-alive frame must reply a keep-alive frame
  • Error Frame : in order to fully support connection lifecycle
  • Handling the unexpected : If Resumption, Leasing or an extension is not supported, rejected error frames must be used

Example

The following are two code snippets take from the RSocket website.

The sample server:

RSocketFactory.receive()
    .frameDecoder(Frame::retain)
    .acceptor(new PingHandler())
    .transport(TcpServerTransport.create(7878))
    .start()
    .block()
    .onClose();

The sample client:

Mono<RSocket> client =
    RSocketFactory.connect()
        .frameDecoder(Frame::retain)
        .transport(TcpClientTransport.create(7878))
        .start();

PingClient pingClient = new PingClient(client);

Recorder recorder = pingClient.startTracker(Duration.ofSeconds(1));

int count = 1_000;

pingClient
    .startPingPong(count, recorder)
    .doOnTerminate(() -> System.out.println("Sent " + count + " messages."))
    .blockLast();

Perform your shell work in parallel

Introduction

In some cases, breaking your larger programming problems into smaller, parallelizable units makes sense from a time complexity problem. If the work you are trying to perform exhibits some of these parallelizable characteristics, you should only need to wait for the longest of your jobs to finish.

In today’s post, we’ll be talking about GNU Parallel.

A summary from their website:

GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input. The typical input is a list of files, a list of hosts, a list of users, a list of URLs, or a list of tables. A job can also be a command that reads from a pipe. GNU parallel can then split the input and pipe it into commands in parallel.

Input

The input system is quite complex. Delimiting the inputs with the ::: operator, parallel will make a catersian product out of the input values.

parallel echo ::: John Paul Sally
John
Paul
Sally

parallel echo ::: John Paul Sally ::: Smith
John Smith
Paul Smith
Sally Smith

parallel echo ::: John Paul Sally ::: Smith Jones Brown
John Smith
John Jones
John Brown
Paul Smith
Paul Jones
Paul Brown
Sally Smith
Sally Jones
Sally Brown

Linkage is possible using :::+, should this flexibility be required.

parallel echo ::: John Paul Sally :::+ Smith Jones Brown
John Smith
Paul Jones
Sally Brown

See more about input sources in the tutorial.

curl

For some examples, I’ll use curl.

Let’s get three web pages downloaded:

  • google.com
  • yahoo.com
  • zombo.com

Getting these down, one at a time times in at nearlly 1.2 seconds.

( curl www.google.com && curl www.yahoo.com && curl www.zombo.com; )  0.02s user 0.05s system 5% cpu 1.195 total

Running these downloads in parallel, we take half a second off the time:

parallel curl {1} ::: www.google.com www.yahoo.com www.zombo.com  0.21s user 0.04s system 33% cpu 0.774 total

Summing up

GNU Parallel is a great utility to get multiple things done at once at the shell. Take a look at the tutorial and immediately become more productive.

Mounting windows filesystems in Linux

A quick note on mounting windows filesystems on linux.

Your platform will require cifs-utils.

sudo apt install cifs-utils

From here, you can connect and mount a remote file system.

sudo mount -t cifs //ip.address.of.windows/location/to/mount /mnt -o user=john,password=mypassword,domain=mydomain

Done. You can now access the remote filesystem.