Data manipulation at the shell
03 Jun 2019Introduction
From time to time, you’ll find that some tasks could be easily achieved at the command line if you just had that one tool that you could slot in. In today’s article, I’ll take you through a few common data manipulation/mangling tools that should get you pretty productive.
head
output the first part of files
The head
command will allow you to peek into a file. This is really handy when you are dealing with huge files, and you only want to sample the first n
lines (or chars).
head myfile.txt
# view the first 10 characters of a file
head -c 10 myfile.txt
# view the first 10 lines of a file
head -n 10 myfile.txt
tail
output the last part of files
The tail
command will allow you to sample the end of a file. tail
works as head
’s compliment. The --follow/-f
switch is very handy with the tail
command. When a file is still being written to, --follow
will allow you to continaully stream the latest bytes being written to a file as they arrive.
tail myfile.txt
# view the last 10 characters of a file
tail -c 10 myfile.txt
# view the last 10 lines of a file
tail -n 10 myfile.txt
# follow the output of a file
tail -f myfile.txt
iconv
convert text from one character encoding to another
Being able to change the character encoding of files that you’re working on can simply your processing greatly. By only needing to deal with a single encoding, you can remove this class of issue from your pipeline. A more comprehensive writeup on iconv
can be found here.
# convert a file from ascii to unicode
iconv -f ascii -t unicode a-test-file > a-test-file.unicode
tr
translate or delete characters
tr
will allow you to translate your input in such a way that you can cleanse information. Translate, squeeze, and/or delete characters as the documentation says.
# replace the spaces with tab characters
echo "All spaced out" | tr [:space:] '\t'
The [:space:]
identifier user here is a special class identifier. There are support for others, too.
Identifier | Description |
---|---|
[:alnum:] |
all letters and digits |
[:alpha:] |
all letters |
[:blank:] |
all horizontal whitespace |
[:cntrl:] |
all control characters |
[:digit:] |
all digits |
[:graph:] |
all printable characters, not including space |
[:lower:] |
all lower case letters |
[:print:] |
all printable characters, including space |
[:punct:] |
all punctuation characters |
[:space:] |
all horizontal or vertical whitespace |
[:upper:] |
all upper case letters |
[:xdigit:] |
all hexadecimal digits |
[=CHAR=] |
all characters which are equivalent to CHAR |
wc
print newline, word, and byte counts for each file
Takes the input and counts things.
# count the number of bytes in a file
wc -c myfile.txt
# count the number of lines
wc -l myfile.txt
split
split a file into pieces
split
takes a file, and cuts it into smaller pieces. This is really handy when your input file is massive; cutting the job down into smaller pieces gives you the chance to parallelize this work appropriately.
split -l 100 contacts.csv contact-
sort
sort lines of text files
The sort
command will allow you to sort a text file by any column, in a couple of different ways.
# sort a csv by the 5th column, alpha
sort -t"," -k5,5 contacts.csv
# sort a csv by the 3rd column, numerically
sort -t"," -k3n,3 contacts.csv
# sort a csv by the 8th column, numberically reverse
sort -t"," -k8nr,8 contacts.csv
uniq
report or omit repeated lines
# show a unique list of names
cat names | uniq
cut
remove sections from each line of files
Cutting columns from your file can be useful if you need to trim information from your data source prior to moving to the next phase of your pipeline.
# remove the fifth column
cut -d, -f 5 contacts.csv
# remove columns 2-though-4
cut -d, -f 2-4 contacts.csv
paste
merge lines of files
The paste
command takes multiple files, and links each line of data together.
# colours.txt
blue
red
orange
# sports.txt
swimming
cricket
golf
These values can be pasted together:
paste -d ',' colours.txt sports.txt
The output of which would look like this:
blue,swimming
red,cricket
orange,golf
join
join lines of two files on a common field
The join
command will run a fairly basic INNER JOIN
between two files. One column from each file will be chosen, and a strong join performed leaving you with the coninciding set.
# join contacts (col 5) on accounts (col 4)
join -t"," -1 5 -2 4 contacts.csv accounts.csv
grep, sed, and awk
Each of these commands really needs their own articles. They are full programming tools in their own right.
All of these are excellent tools to allow you to build complex processing pipelines from your console.