Text Hackery with Sed

2015-04-23

I've recently had the opportunity (or misfortune depending on your outlook) to do some work in sed, the stream editor. I am working on files too large to reliably or quickly read into memory so I turned to some old-fashioned Unix power tools.

SQL Queries to Table Header

I have been working on a gross amalgam of crufty old software, mucking about on an in-house test framework. Ideally, I wouldn't be patching over some serious flaws and instead be fixing the root cause but requirements beyond my control necessitate otherwise. I needed to extract the field names specified in a SQL query from files containing both the query and the resultant data.

To give you an idea what the files look like:

    ----- INPUT -----
    SELECT
        FIELDNAME1, FIELDNAME2, FIELDNAME3, FIELDNAME4,
        ...
        FIELDNAME42, FIELDNAME43, FIELDNAME44, FIELDNAME45
    FROM TABLE_NAME
        ORDER BY
          1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,
          24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45
    ----- OUTPUT -----

     344    344  34407  3440704  NULL          NULL           0     0.9799  ...

     355  7.355  7.337    7.337  6887.39747    993389.556     NULL       0  ...

     344    344  34407  3440705  NULL          NULL           0     76.586  ...

     846  4.846   4.69     4.69  70367.377465  7769893.60437  NULL       0  ...

     ...    ...    ...      ...  ...           ...            ...      ...  ...

With files sometimes reaching over a million lines in length, it was a natual fit to pipe each line through various stages of processing to extract and transform the data into the desired format. By operating at the line level the whole process is pretty light on memory and fast.

sed -n '/SELECT/,/FROM/p' ./filename.txt

This defines the range I'll be working in by the two patterns (start,end). Sed won't bother applying any rules or editing operations to those lines outside this range. The -n tells sed to not print every line it "sees" and the p flag tells it to print those lines matched by the range pattern.

sed -e '/SELECT/d' -e '/FROM/d'

Delete the lines containing words SELECT and FROM, those would muck up the formatting I was aiming for.

sed 's/^[ \t]*//g'

This strips leading white space, which isn't strictly necessary for how this data is later used in the pipeline, but made developing and debugging the script much nicer to read.

sed 's/,//g'

Strips commas, again, not strictly necessary but a sight better to work on.

If you pipe each of those steps end to end on the block of data above you get the following:

    FIELDNAME1 FIELDNAME2 FIELDNAME3 FIELDNAME4
    ...
    FIELDNAME42 FIELDNAME43 FIELDNAME44 FIELDNAME45

Which is nearly what I'm looking for. The goal is to extract the field names into a table header for the data (again, external requirements, not my idea!). I struggled for a while on how best to capture all of the field names, initially I had split them each onto their own line and struggled with AWK to transpose the whole thing into one line. Not for nothing though am I reading The UNIX Programming Environment, I realized I could simply call tr to replace the newlines with a single space, which gives the following:

    FIELDNAME1 FIELDNAME2 FIELDNAME3 FIELDNAME4 ... FIELDNAME42 FIELDNAME43 FIELDNAME44 FIELDNAME45 bash-$>

This is slightly idealized, the regular expressions proved a bit too fragile as written to cope with the variability seen in some of the data. It isn't much of a stretch though to see how this can be extended to account for things like SQL's ROUND(fieldname,4) as a field name or keywords like FROM that fall on the same line as a field name.

I had some difficulty figuring out how to gracefully combine the table data with this new header line. Obviously I knew about cat but I was trying to avoid temp files and named pipes and it took a bit of digging to find a pretty neat trick. The shell's process substitution allows for the following (technically it's built on top of temporary files or named pipes, but in using it I can ignore this fact). I think this might be my favorite new thing ¹.

      cat \
        <(sed -n '/SELECT/,/FROM/p' filename.txt \
        | sed -e '/SELECT/d' -e '/FROM/d' \
        | sed 's/^[ \t]*//g' \
        | sed 's/,//g' \
        | tr '\n' ' '; echo "") \ # the echo forces a newline
        <(table-data-from-another-process) \
      | column -t # dead simple text tables!

Which feels to me pretty Unix-y, leveraging a variety of small tools to accomplish the job. The result looks like this:

    FIELDNAME1  FIELDNAME2  FIELDNAME3  FIELDNAME4  ...           FIELDNAME42    FIELDNAME43  FIELDNAME44  FIELDNAME45
    344         344         34407       3440704     NULL          NULL           0            0.9799       ...
    355         7.355       7.337       7.337       6887.39747    993389.556     NULL         0            ...
    344         344         34407       3440705     NULL          NULL           0            76.586       ...
    846         4.846       4.69        4.69        70367.377465  7769893.60437  NULL         0            ...

Addendum

Several days later, I've since found that the above breaks down on more complex SQL queries and was forced to re-evaluate the whole thing. What I settled on instead is the following AWK script to split out the headers, keeping the calls to cat, tr, and column -t.

        #!/usr/bin/awk -f

        BEGIN {
            IGNORECASE = 1
            PAREN_COUNTER = 0
        }

        /SELECT/,/FROM/ {
            sub(/SELECT/, "")
            sub(/FROM.*/, "")
            split($0, characters, "")

            for (i=1; i<=length(characters); i++) {

                if (characters[i] == "(") {
                    PAREN_COUNTER++
                }
                else if (characters[i] == ")") {
                    PAREN_COUNTER--
                }

                if (characters[i] == "," && PAREN_COUNTER == 0) {
                    printf("%s\n", characters[i])
                }
                else {
                    printf("%s", characters[i])
                }
            }
        }

It works on nested parentheses and ultimately is more readable. I suppose it's a win; albeit a little boring to spend a day "fixing" an issue without addressing the root cause.

ANSI Escape Codes as HTML

This one could probably also be filed under "problems I shouldn't be solving" but it was a quick hack to get it working. I had a need to capture the colorized output of git diff --color-words, which can capture nicely the differences in tabular data like that shown above. The colorization in the terminal relies on ANSI terminal escape codes, which are interpretered by the terminal according to various system or user level preferences. You may have seen them if you've ever played with changing the colors of your bash prompt: \033[31m\033[39m\033 .

In retrospect, sed was not the best tool for this job but it was pretty painless to iteratively develop the regular expressions needed to capture and format the text. I may or may not look into porting this to something more robust in the future.

Limited Scope

Thankfully, the output of git diff is pretty small, so I didn't need to handle too much of the actual range of ANSI escape codes. The full formatting code is this:

      git diff --color-words file1 file2 \
        | sed 's#\x1b\[1m\(.*\)\x1b\[m#<span class="bold">\1</span>#g' \
        | sed 's#\(.*\)\x1b\[m$#\1#g' \
        | sed 's#\x1b\[31m\([^\x1b\[m]*\)#<span class="red">\1</span>#g' \
        | sed 's#\x1b\[32m\([^\x1b\[m]*\)#<span class="green">\1</span>#g' \
        | sed 's#\x1b\[m##g' \
        | sed 's#\x1b\[36m##g'

This was actually done the day before the SQL query extraction, which may be obvious in my (mis-)handling of sed. I figured, it's a stream editor so what's the harm in one more pipeline stage? I'm certain there is a less hack-y way about this but working is good enough for me.

Some points worth mentioning:

sed allows for delimiters other than /, which would conflict with the closing HTML tags, so I opt for # here
sed captures text with \( regex \) which can then be referenced with \1. I have since found out that I should have used \&, which refers to the matching regex group instead of a back-reference - but it works like this
I had lingering escape characters after I had finished, which is why the last stages simply delete

Addendum

After a few trial runs git diff proved to be too slow for the volumes of data I was throwing at it, the above sed hackery however performed admirably. With the benefit of several days hindsight, I can say I would probably not use sed in the first place, and if I did, write the whole thing as a single sed command. There's no real readability lost and the entire process is one logical block of functionality, rather than six separate stages.

      git diff --color-words file1 file2 \
        | sed 's#\x1b\[1m\(.*\)\x1b\[m#<span class="bold">\1</span>#g;
               s#\(.*\)\x1b\[m$#\1#g;
               s#\x1b\[31m\([^\x1b\[m]*\)#<span class="red">\1</span>#g;
               s#\x1b\[32m\([^\x1b\[m]*\)#<span class="green">\1</span>#g;
               s#\x1b\[m##g;
               s#\x1b\[36m##g'

The Shell as a REPL

One thing I found in working through these issues was how much I enjoy working iteratively on the command line; despite the class of problems I'm stuck solving. The shell is a REPL with a fantastic standard library of functionality readily available, putting me in mind of Python or Elisp. There are undoubtedly some idiosyncrasies involved in treating everything as a file, or standardizing on streams as the common interface but the foundations are solid.

Interesting fact, the shell automatically parallelizes jobs that have been substituted like this