I've recently had the opportunity (or misfortune depending on your outlook) to do some work in sed, the stream editor. I am working on files too large to reliably or quickly read into memory so I turned to some old-fashioned Unix power tools.
I have been working on a gross amalgam of crufty old software, mucking about on an in-house test framework. Ideally, I wouldn't be patching over some serious flaws and instead be fixing the root cause but requirements beyond my control necessitate otherwise. I needed to extract the field names specified in a SQL query from files containing both the query and the resultant data.
To give you an idea what the files look like:
----- INPUT -----
SELECT
FIELDNAME1, FIELDNAME2, FIELDNAME3, FIELDNAME4,
...
FIELDNAME42, FIELDNAME43, FIELDNAME44, FIELDNAME45
FROM TABLE_NAME
ORDER BY
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,
24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45
----- OUTPUT -----
344 344 34407 3440704 NULL NULL 0 0.9799 ...
355 7.355 7.337 7.337 6887.39747 993389.556 NULL 0 ...
344 344 34407 3440705 NULL NULL 0 76.586 ...
846 4.846 4.69 4.69 70367.377465 7769893.60437 NULL 0 ...
... ... ... ... ... ... ... ... ...
With files sometimes reaching over a million lines in length, it was a natual fit to pipe each line through various stages of processing to extract and transform the data into the desired format. By operating at the line level the whole process is pretty light on memory and fast.
sed -n '/SELECT/,/FROM/p' ./filename.txt
This defines the range I'll be working in by the two patterns (start,end). Sed
won't bother applying any rules or editing operations to those lines outside
this range. The -n
tells sed to not print every line it "sees" and the p
flag tells it to print those lines matched by the range pattern.
sed -e '/SELECT/d' -e '/FROM/d'
Delete the lines containing words SELECT
and FROM
, those would muck up the
formatting I was aiming for.
sed 's/^[ \t]*//g'
This strips leading white space, which isn't strictly necessary for how this data is later used in the pipeline, but made developing and debugging the script much nicer to read.
sed 's/,//g'
Strips commas, again, not strictly necessary but a sight better to work on.
If you pipe each of those steps end to end on the block of data above you get the following:
FIELDNAME1 FIELDNAME2 FIELDNAME3 FIELDNAME4
...
FIELDNAME42 FIELDNAME43 FIELDNAME44 FIELDNAME45
Which is nearly what I'm looking for. The goal is to extract the field names
into a table header for the data (again, external requirements, not my idea!). I
struggled for a while on how best to capture all of the field names, initially I
had split them each onto their own line and struggled with AWK to transpose the
whole thing into one line. Not for nothing though am I reading The UNIX
Programming
Environment, I
realized I could simply call tr
to replace the newlines with a single space,
which gives the following:
FIELDNAME1 FIELDNAME2 FIELDNAME3 FIELDNAME4 ... FIELDNAME42 FIELDNAME43 FIELDNAME44 FIELDNAME45 bash-$>
This is slightly idealized, the regular expressions proved a bit too fragile as
written to cope with the variability seen in some of the data. It isn't much of
a stretch though to see how this can be extended to account for things like
SQL's ROUND(fieldname,4)
as a field name or keywords like FROM
that fall on
the same line as a field name.
I had some difficulty figuring out how to gracefully combine the table data with
this new header line. Obviously I knew about cat
but I was trying to avoid
temp files and named pipes and it took a bit of digging to find a pretty neat
trick. The shell's process
substitution allows
for the following (technically it's built on top of temporary files or named
pipes, but in using it I can ignore this fact). I think this might be my
favorite new thing 1.
cat \
<(sed -n '/SELECT/,/FROM/p' filename.txt \
| sed -e '/SELECT/d' -e '/FROM/d' \
| sed 's/^[ \t]*//g' \
| sed 's/,//g' \
| tr '\n' ' '; echo "") \ # the echo forces a newline
<(table-data-from-another-process) \
| column -t # dead simple text tables!
Which feels to me pretty Unix-y, leveraging a variety of small tools to accomplish the job. The result looks like this:
FIELDNAME1 FIELDNAME2 FIELDNAME3 FIELDNAME4 ... FIELDNAME42 FIELDNAME43 FIELDNAME44 FIELDNAME45
344 344 34407 3440704 NULL NULL 0 0.9799 ...
355 7.355 7.337 7.337 6887.39747 993389.556 NULL 0 ...
344 344 34407 3440705 NULL NULL 0 76.586 ...
846 4.846 4.69 4.69 70367.377465 7769893.60437 NULL 0 ...
Several days later, I've since found that the above breaks down on more complex
SQL queries and was forced to re-evaluate the whole thing. What I settled on
instead is the following AWK script to split out the headers, keeping the calls
to cat
, tr
, and column -t
.
#!/usr/bin/awk -f
BEGIN {
IGNORECASE = 1
PAREN_COUNTER = 0
}
/SELECT/,/FROM/ {
sub(/SELECT/, "")
sub(/FROM.*/, "")
split($0, characters, "")
for (i=1; i<=length(characters); i++) {
if (characters[i] == "(") {
PAREN_COUNTER++
}
else if (characters[i] == ")") {
PAREN_COUNTER--
}
if (characters[i] == "," && PAREN_COUNTER == 0) {
printf("%s\n", characters[i])
}
else {
printf("%s", characters[i])
}
}
}
It works on nested parentheses and ultimately is more readable. I suppose it's a win; albeit a little boring to spend a day "fixing" an issue without addressing the root cause.
This one could probably also be filed under "problems I shouldn't be solving"
but it was a quick hack to get it working. I had a need to capture the
colorized output of git diff --color-words
, which can capture nicely the
differences in tabular data like that shown above. The colorization in the
terminal relies on
ANSI terminal escape codes,
which are interpretered by the terminal according to various system or user
level preferences. You may have seen them if you've ever played with changing
the colors of your bash prompt: \033[31m\033[39m\033
.
In retrospect, sed was not the best tool for this job but it was pretty painless to iteratively develop the regular expressions needed to capture and format the text. I may or may not look into porting this to something more robust in the future.
Thankfully, the output of git diff
is pretty small, so I didn't need to handle
too much of the actual range of ANSI escape codes. The full formatting code is
this:
git diff --color-words file1 file2 \
| sed 's#\x1b\[1m\(.*\)\x1b\[m#<span class="bold">\1</span>#g' \
| sed 's#\(.*\)\x1b\[m$#\1#g' \
| sed 's#\x1b\[31m\([^\x1b\[m]*\)#<span class="red">\1</span>#g' \
| sed 's#\x1b\[32m\([^\x1b\[m]*\)#<span class="green">\1</span>#g' \
| sed 's#\x1b\[m##g' \
| sed 's#\x1b\[36m##g'
This was actually done the day before the SQL query extraction, which may be obvious in my (mis-)handling of sed. I figured, it's a stream editor so what's the harm in one more pipeline stage? I'm certain there is a less hack-y way about this but working is good enough for me.
Some points worth mentioning:
/
, which would conflict with the
closing HTML tags, so I opt for #
here\( regex \)
which can then be referenced with \1
.
I have since found out that I should have used \&
, which refers to the
matching regex group instead of a back-reference - but it works like thisAfter a few trial runs git diff
proved to be too slow for the volumes of data
I was throwing at it, the above sed hackery however performed admirably. With
the benefit of several days hindsight, I can say I would probably not use sed in
the first place, and if I did, write the whole thing as a single sed command.
There's no real readability lost and the entire process is one logical block of
functionality, rather than six separate stages.
git diff --color-words file1 file2 \
| sed 's#\x1b\[1m\(.*\)\x1b\[m#<span class="bold">\1</span>#g;
s#\(.*\)\x1b\[m$#\1#g;
s#\x1b\[31m\([^\x1b\[m]*\)#<span class="red">\1</span>#g;
s#\x1b\[32m\([^\x1b\[m]*\)#<span class="green">\1</span>#g;
s#\x1b\[m##g;
s#\x1b\[36m##g'
One thing I found in working through these issues was how much I enjoy working iteratively on the command line; despite the class of problems I'm stuck solving. The shell is a REPL with a fantastic standard library of functionality readily available, putting me in mind of Python or Elisp. There are undoubtedly some idiosyncrasies involved in treating everything as a file, or standardizing on streams as the common interface but the foundations are solid.