Parquet

06 Jul 2023

The Apache Parquet file format is used widely in the data space. It’s a column-oriented format that focuses on storing data as efficiently as possible, with emphasis on data retrieval.

Why?

The most common storage format that you’d use to hold data is CSV. It’s a good, simple format but has quite a number of short-comings when it comes to data analysis.

Parquet is a column-oriented format which is much more sympathetic to data analysis. CSV is row-oriented, which is a much better application for an OLTP scenario.

Parquet offers compression and partitioning that is simply not available to the CSV format.

So, it stores information - but is better suited to the data warehouse idea.

Python

A really easy way to get started using Parquet is with Python. The PyArrow library is a set of utilities that allow you to work with in-memory analytics tools. PyArrow plays nicely with Pandas and NumPy so it’s a good fit.

Make sure you have pyarrow installed as a dependency.

import pandas as pd
import numpy as np
import pyarrow as pa

import pyarrow.parquet as pq

Create a table

First off, we’ll create a DataFrame from some raw data.

data = [
    ["John", "Smith", 23],
    ["Amber", "Brown", 31],
    ["Mark", "Green", 22],
    ["Jane", "Thomas", 26]
]

df = pd.DataFrame(
    data, 
    columns=["first_name", "last_name", "age"]
)

We can then use from_pandas function to create a pyarrow.Table from this DataFrame.

table = pa.Table.from_pandas(df)

Basic I/O

Now that we have loaded a Table, we can save this data using write_table and pick it back up off disk using read_table.

pq.write_table(table, 'people.parquet')

people = pq.read_table('people.parquet')
people.to_pandas()

Finishing up

There’s a lot more benefits to picking a more optimal data storage solution when working in the data analytics space.

Libuv

04 Jul 2023

—uvBuf layout: post title: libuv date: 2023-07-04 comments: false categories: [ “” ] —

libuv is a multi-platform library that provides your programs with asynchronous capabilities through the use of an event loop. node.js has been the most mainstream usage of this library.

Today’s post will talk about this library and show some working examples.

Features

libuv provides a quite a set of features:

Event loop
Async file and network I/O
File system events
IPC
Thread pool
Signal handling
High resolution clock

Event loop

When you’re programming in an event-driven environment, you need a medium that can transfer control over to your program when an event occurs. The event loop’s job is to do exactly this, running forever.

If you were to think about it in c-pseudo code, it might look something like this.

while (events_to_process) {
    event = get_next_event();
    
    if (event.callback) {
        event.callback();
    }
}

Watchers

The list of handles that can send us events, and signal our application are here:

/* Handle types. */
typedef struct uv_loop_s uv_loop_t;
typedef struct uv_handle_s uv_handle_t;
typedef struct uv_stream_s uv_stream_t;
typedef struct uv_tcp_s uv_tcp_t;
typedef struct uv_udp_s uv_udp_t;
typedef struct uv_pipe_s uv_pipe_t;
typedef struct uv_tty_s uv_tty_t;
typedef struct uv_poll_s uv_poll_t;
typedef struct uv_timer_s uv_timer_t;
typedef struct uv_prepare_s uv_prepare_t;
typedef struct uv_check_s uv_check_t;
typedef struct uv_idle_s uv_idle_t;
typedef struct uv_async_s uv_async_t;
typedef struct uv_process_s uv_process_t;
typedef struct uv_fs_event_s uv_fs_event_t;
typedef struct uv_fs_poll_s uv_fs_poll_t;
typedef struct uv_signal_s uv_signal_t;

/* Request types. */
typedef struct uv_req_s uv_req_t;
typedef struct uv_getaddrinfo_s uv_getaddrinfo_t;
typedef struct uv_getnameinfo_s uv_getnameinfo_t;
typedef struct uv_shutdown_s uv_shutdown_t;
typedef struct uv_write_s uv_write_t;
typedef struct uv_connect_s uv_connect_t;
typedef struct uv_udp_send_s uv_udp_send_t;
typedef struct uv_fs_s uv_fs_t;
typedef struct uv_work_s uv_work_t;

/* None of the above. */
typedef struct uv_cpu_info_s uv_cpu_info_t;
typedef struct uv_interface_address_s uv_interface_address_t;
typedef struct uv_dirent_s uv_dirent_t;

These are the handles that we can register interest in; so the system will raise interesting events to us.

Get started

Before we get started, the libuv library needs to be installed along with the development files. In order to do this on my Debian machine, I’ll install the development and the runtime files.

libuv1 - asynchronous event notification library - runtime library
libuv1-dev - asynchronous event notification library - development files

Now, when we build an executable we need to link to the uv library using -luv. For the CMake test application that I’m writing with this article, I used:

target_link_libraries(uvtest uv)

Where uvtest is the name of my application.

First program

The “hello, world” of event loops. We’ll allocate the event loop, run the loop, and then cleanup.

int main() {
    /* allocate and init the loop */
    uv_loop_t *loop = malloc(sizeof(uv_loop_t));
    uv_loop_init(loop);

    /* run the loop */
    uv_run(loop, UV_RUN_DEFAULT);

    /* clean up */
    uv_loop_close(loop);
    free(loop);
    
    return 0;
}

Idle

While our program is doing “nothing”, waiting for the next event we can register a function to execute. You’ll notice in this code that we’re using a uv_idle_t rather than a uv_loop_t (as above). Using uv_idle_t provides us access to register an “idler” function.

int64_t counter = 0;

void count_to_10(uv_idle_t* handle) {
    printf("Counter at: %d\n", counter++);

    if (counter > 10) {
        uv_idle_stop(handle);
    }
}

int main() {
    uv_idle_t idler;

    uv_idle_init(uv_default_loop(), &idler);
    uv_idle_start(&idler, count_to_10);

    uv_run(uv_default_loop(), UV_RUN_DEFAULT);

    uv_loop_close(uv_default_loop());
    return 0;
}

The idle function, count_to_10 counts up until we exceed 10 and then calls uv_idle_stop which is our exit.

Finishing up

This has just been an introduction to the absolute basics of libuv.

The Rrrola Constant

02 Jul 2023

When you’re writing demos in 320x200 mode, you can quickly estimate X and Y coordinates from a screen offset with one simple multiply. That value that you use is 0xcccd, and it’s called the Rrrola Constant.

To start this, we need to adjust our video address. Remember, this is an estimate (it’s good enough); but it does need a bit more “help” in the setup.

  push  0xa000 - 10			
  pop   es				


top:
  xor   di, di
  mov   cx, 64000

Right now, es:[di] is pointing to the start of video memory (adjusted).

Now we perform the multiply

pat:
  mov  ax, 0xcccd
  mul  di

At this point the (x, y) pair is now available to us in (dh, dl). This is really handy for use in your rendering functions.

In this example, we just make a pixel that’s x xor y.

  xor  dh, dl
  mov  al, dh
  stosb
  loop pat

This works because the offset into the video buffer is worked out as (y * 320) + x. Multiplying this formula out by Oxcccd we end up with (y * 0x1000040) + (x * 0xcccd)

The top byte is y * 0x1000000. The next byte along is now (x * 0xcccd / 0x10000) which approximates to (x * 256/320), which is useful to us. The lower two bytes from the product are garbage.

Full example

The following is a .com demo-style example which uses the above technique:

  org  100h

start:
  ; setup 320x200x256
  mov  ax, 0x0013
  int  0x10

  ; adjust screen segment to work
  ; with the Rrrloa trick
  push 0xa000 - 10
  pop  es

top:
  ; start at the beginning of
  ; the video buffer
  xor  di, di
  mov  cx, 64000

pat:
  ; (dh, dl) -> (x, y)
  mov  ax, 0xcccd			 
  mul  di					

  ; col = x ^ y
  xor  dh, dl
  mov  al, dh
  
  ; paint
  stosb
  loop pat

  ; check for esc
  in  al, 60h				
  cmp al, 1
  
  jne top

  ; return to text
  mov  ax, 0x0003			
  int  0x10

  ; return to dos
  mov  ax, 0x4c00			
  int  0x21

Pretty!

XOR Demo

Network Traffic Analysis with tcpdump

30 Aug 2022

Introduction

Sometimes it can be of value to be able to isolate and analyse specific network traffic that is flowing through your network interface. tcpdump offers you this capability in a command line application.

There are many tutorials already that take you through tcpdump comprehensively, so this article will just be constrained to usages that have benefited me.

Reading the output

In order for this tool to be of any use, it pays to know how to read the output. In this example I’m capturing all of the port 80 traffic flowing through my network interface.

tcpdump -nnSX port 80

The stream of output that you see after this (once you have some port 80 traffic going) is the output that you’ll use for analysis. Here’s an excerpt after hitting the first page on the internet.

21:49:54.056071 IP 192.168.20.35.60548 > 188.184.21.108.80: Flags [P.], seq 3043086668:3043087150, ack 3119373143, win 502, options [nop,nop,TS val 3532078292 ecr 2432081556], length 482: HTTP: GET /hypertext/WWW/TheProject.html HTTP/1.1
        0x0000:  4500 0216 a4d4 4000 4006 ed1d c0a8 1423  E.....@.@......#
        0x0010:  bcb8 156c ec84 0050 b561 d14c b9ed db57  ...l...P.a.L...W
        0x0020:  8018 01f6 62a6 0000 0101 080a d287 3cd4  ....b.........<.
        0x0030:  90f6 9e94 4745 5420 2f68 7970 6572 7465  ....GET./hyperte
        0x0040:  7874 2f57 5757 2f54 6865 5072 6f6a 6563  xt/WWW/TheProjec
        0x0050:  742e 6874 6d6c 2048 5454 502f 312e 310d  t.html.HTTP/1.1.

There’s lots here.

We’re given the time of the packet being observed 21:49:54.056071.

We’re given the network layer protocol IP, source address (my machine) 192.168.20.35 and port 60584; along with the destination 188.184.21.108 (on port 80).

The next field Flags [P.] is an encoded representation of the TCP flags. The following table gives a breakdown of these flag values.

Value	Flag	Description
S	SYN	Connection start
F	FIN	Connection finish
P	PUSH	Data push
R	RST	Connection reset
.	ACK	Acknowledgement

The combination of values tells you the flags that are up. In this case P. tells us this is a PUSH-ACK packet.

The sequence number seq 3043086668:3043087150 tells us the run of bytes contained within this sample. The ack value ack 3119373143 is the next byte expected. The win value tells us the number of bytes available in the buffer followed by the TCP options.

The packet length is given at the end of the line.

The data frame is now split into a hexadecimal representation in the middle (given by -X); and the ASCII representation to the right.

With the basic output view out of the way, we get move onto some useful invocations.

Invocations

Filter by Port

As per the above example, we can filter traffic by any port that we give to port switch. Here we can see any SMTP traffic.

# just the traffic on port 25
tcpdump port 25

# traffic on ports ranging from 25 to 30
tcpdump portrange 25-30

Everything

Sometimes it can be useful to just receive everything flowing through a network interface.

tcpdump -i wlp4s0

Filter by Host

You can use the host keyword to see traffic going to or coming from an IP address. You can constrain this even further using src (coming from) or dest (going to).

tcpdump host 192.168.20.1

# packets going to 20.1
tcpdump dst 192.168.20.1

# packets coming from 20.1
tcpdump src 192.168.20.1

Filter by Network

Using broader strokes, you can use net to specify a full network to filter packets on. This will allow you to filter a whole network or subnet.

tcpdump net 192.168.20.0/24

Filter by Protocol

Just seeing ping (ICMP) traffic can be filtered like so:

tcpdump icmp

Conclusion

tcpdump is a very useful network analysis tool do perform discoveries on what’s actually happening. There’s a lot more power that can be unlocked by combining some of these basic filters together using logical concatenators.

MYO Language with Antlr

30 Aug 2022

Introduction

ANTLR is a code generation tool for making language parsers. Using a grammer file, you can get ANTLR to generate code to read, interpret, and execute your very own code.

In today’s article I’ll walk through the basic setup to create a Calculator language that can execute simple equations in a golang project of our own.

Before you begin

You’ll need a JRE.

Before we start, there are some software pre-requisites. You will need to install ANTLR. This is a simple JAR File that we can invoke locally.

$ wget http://www.antlr.org/download/antlr-4.7-complete.jar
$ alias antlr='java -jar $PWD/antlr-4.7-complete.jar'

Code generation

Now that we’ve got ANTLR installed, it’s time to generate some code. We do this using a grammer file. A very comprehensive calculator can be found in the examples of the antlr grammers repository here.

For today’s example, we’ll just focus on addition, subtraction, multiplication, and division with the following grammer file:

// Calc.g4
grammar Calc;

// Tokens
MUL: '*';
DIV: '/';
ADD: '+';
SUB: '-';
NUMBER: [0-9]+;
WHITESPACE: [ \r\n\t]+ -> skip;

// Rules
start : expression EOF;

expression
   : expression op=('*'|'/') expression # MulDiv
   | expression op=('+'|'-') expression # AddSub
   | NUMBER                             # Number
   ;

Even without fully understanding the grammer language, you can see that there is some basic token definitions, rules, and expression definitions.

MUL, DIV, ADD, SUB, NUMBER, and WHITESPACE all being significant to the language that we’re definting.

The expression definition not only defines operations for us, but will also be key in defining operator precedence, with the MulDiv rule occuring before the AddSub rule, finally dealing with Number.

We can turn this grammer file into some go code with the following invocation:

$ antlr -Dlanguage=Go -o parser Calc.g4

This creates a parser folder for us now with a few different pieces of go code.

Parsers, Lexers, and Listener

If you look in the parser folder at the code that was created, you shoul see something similar to this:

└── parser
    ├── calc_base_listener.go
    ├── calc_lexer.go
    ├── CalcLexer.tokens
    ├── calc_listener.go
    ├── calc_parser.go
    └── Calc.tokens

The Lexer’s job is to perform Lexical Analysis on arbitrary pieces of text, and tokenizes that text into a set of symbols. For example, the input of 1 + 2 might get tokenized to NUMBER 1, ADD, NUMBER 2. These tokens are now fed into the parser.

The Parser’s job is to take these tokens, and make sure they conform to the rules of the language. You can imagine that a LISP style language would expect ADD, NUMBER 1, NUMBER 2 rather than a c-style language that would expect the operator in between the number tokens.

After the string has passed through the lexer and the parser, it now runs through the listener where we can write some code to respond to these symbols in order.

Implementation

The internal implementation of this calculator is a stack-based calculator. This gets represented as struct:

type calculatorListener struct {
	*parser.BaseCalcListener
	stack []int
}

The internal state of the calculator are int values on that stack. As operations execute, the program will take the top of the stack as well that second-to-the-top and perform arithmetic, leaving the result on the top of the stack.

func (l *calculatorListener) push(i int) {
	l.stack = append(l.stack, i)
}

func (l *calculatorListener) pop() int {
	if len(l.stack) < 1 {
		panic("TOS invalid")
	}

	result := l.stack[len(l.stack)-1]
	l.stack = l.stack[:len(l.stack)-1]

	return result
}

The BaseCalcListner type that was generated for us has all of the hooks we need to latch onto the complete the implementation. The NUMBER, ADDSUB, and MULDIV rules all get their own listener for us to respond to.

func (l *calculatorListener) ExitMulDiv(c *parser.MulDivContext) {
  // get TOS and STOS
	rhs, lhs := l.pop(), l.pop()

  // perform the required operation, pushing the result back
  // up as the new TOS
	switch c.GetOp().GetTokenType() {
	case parser.CalcParserMUL:
		l.push(lhs * rhs)
	case parser.CalcParserDIV:
		l.push(lhs / rhs)
	default:
		panic(fmt.Sprintf("not yet implemented: %s", c.GetOp().GetText()))
	}
}

func (l *calculatorListener) ExitAddSub(c *parser.AddSubContext) {
  // get TOS and STOS
	rhs, lhs := l.pop(), l.pop()

  // perform the required operation, pushing the result back
  // up as the new TOS
	switch c.GetOp().GetTokenType() {
	case parser.CalcParserADD:
		l.push(lhs + rhs)
	case parser.CalcParserSUB:
		l.push(lhs - rhs)
	default:
		panic(fmt.Sprintf("not yet implemented: %s", c.GetOp().GetText()))
	}
}

func (l *calculatorListener) ExitNumber(c *parser.NumberContext) {
  // coerce the string into an integer
	i, err := strconv.Atoi(c.GetText())
	if err != nil {
		panic(err.Error())
	}

  // push onto the stack
	l.push(i)
}

Execution

Now we go from text input to execution. In the following snippet, the input stream feeds the text into the lexer. The lexer then gets setup as a stream ready to tokenize our input.

Finally, all of those tokens get parsed to make sure they represent valid expressions for our language.

equation := "1 + 5 - 2 * 20"
is := antlr.NewInputStream(equation)
lexer := parser.NewCalcLexer(is)
stream := antlr.NewCommonTokenStream(lexer, antlr.TokenDefaultChannel)

p := parser.NewCalcParser(stream)

We can now walk the parser tree with a listener attached. The listener will fire off our hooks that we defined earlier; and our stack-based calculator should leave us with the result at the TOS.

var listener calcListener
antlr.ParseTreeWalkerDefault.Walk(&listener, p.Start())
answer := listener.pop()

fmt.Printf("%s = %d", equation, answer)

We should be left with something like this on screen:

1 + 5 - 2 * 20 = -34

Conclusion

As you can see, ANTLR is a very powerful tool for writing all of the pieces of a compiler (or in this case, an interpreter) to get you kick started very quickly.

You’d almost be insane to ever do this stuff yourself!

Older Newer

Cogs and Levers A blog full of technical stuff

Parquet

Why?

Python

Create a table

Basic I/O

Finishing up

Libuv

Features

Event loop

Watchers

Get started

First program

Idle

Finishing up

The Rrrola Constant

Full example

Network Traffic Analysis with tcpdump

Introduction

Reading the output

Invocations

Filter by Port

Everything

Filter by Host

Filter by Network

Filter by Protocol

Conclusion

MYO Language with Antlr

Introduction

Before you begin

Code generation

Parsers, Lexers, and Listener

Implementation

Execution

Conclusion