Monitoring Spam Activity

2016-02-23

Automating some of the less glamorous aspects of managing a server. There is always more to do in keeping up internet services.

Recap

I noticed recently that some manner of botnet was posting gratuitous amounts of spam to a pastebin that I operate. The fix was to blacklist the entire subnet's address range.

Not So Fast

iptables do accomplish what I stated, with an important caveat that I did not address in the last post. They do not persist after rebooting. I discovered this fact after patching my system for the latest glibc vulnerability. I was scratching my head when I realized the same IP addresses I had just blocked were posting again, and only because I was manually rechecking the log files.

The solution is to use the iptables-save and iptables-restore commands during the boot sequence. Debian has a simple wrapper for this, iptables-persistent.

Issues with the current state of things

The biggest annoyance was not that I had to install a separate package for handling iptables but that I only discovered the issue after manually checking the log files. I set about writing a simple report on recent usage that will handle this for me.

A "Good Enough" Solution

I am not yet comfortable enough blanket identifying bad actors at the subnet level to fully automate the process. Instead I have settled on reporting suspicious activity.

For reference, a sample of Nginx's access.log looks as follows:

188.143.232.24 - - [14/Feb/2016:07:30:45 +0000] "POST / HTTP/1.1" 302 287 "http://glue.nprescott.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
188.143.232.24 - - [14/Feb/2016:07:30:46 +0000] "GET /content/974fee HTTP/1.1" 200 1639 "http://glue.nprescott.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
188.143.232.24 - - [14/Feb/2016:07:30:46 +0000] "POST / HTTP/1.1" 302 287 "http://glue.nprescott.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
188.143.232.24 - - [14/Feb/2016:07:30:46 +0000] "GET /content/16805f HTTP/1.1" 200 1338 "http://glue.nprescott.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
188.143.232.24 - - [14/Feb/2016:07:30:47 +0000] "POST / HTTP/1.1" 302 287 "http://glue.nprescott.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
188.143.232.24 - - [14/Feb/2016:07:30:47 +0000] "GET /content/6008f9 HTTP/1.1" 200 1500 "http://glue.nprescott.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
188.143.232.11 - - [14/Feb/2016:07:31:46 +0000] "GET / HTTP/1.1" 200 1214 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
188.143.232.11 - - [14/Feb/2016:07:31:46 +0000] "POST / HTTP/1.1" 302 287 "http://glue.nprescott.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"

I approximated my manual process to the following steps:

look for "similar" IP addresses
check number of POST requests per day
check number of requests in recent days
filter IP address for suspicious activity (spam)

The first three are straight-forward to automate and I deigned to leave the final step a manual one, provided I send a notification for unusual activity.

I am using AWK for this, specifically GAWK (or NAWK) due to the strftime call, which is not available in MAWK and some others.

#!/usr/bin/gawk -f

 BEGIN {
     TODAY = strftime("%d/%b/%Y")
 }

 /POST.*glue.nprescott.com/ {
     gsub("\\[", "", $4)
     split($4, DATE_ARR, ":")
     DATE = DATE_ARR[1]

     split($1, NETWORK, ".")
     CURRENT_NETWORK = NETWORK[1] "." NETWORK[2] "." NETWORK[3]
     POSTS[DATE ":" CURRENT_NETWORK] += 1
 }

 END {
     for (TUPLE in POSTS) {
         split(TUPLE, DATE_NETWORK, ":")
         if (DATE_NETWORK[1] == TODAY) {
             TOTAL_TODAY += POSTS[TUPLE]
         }
         print(DATE_NETWORK[2], POSTS[TUPLE], DATE_NETWORK[1])
     }

     if (TOTAL_TODAY > 500) {
         exit 1
     }
 }

This simply counts POST requests per day, per IP subnetwork, ignoring host identifiers, and prints the subnetwork, number of requests, and date. I call this with a shell script, running on a crontab:

   #!/bin/bash

    TEMPFILE=/tmp/log-watch.txt

    gawk -f parse-log.awk /var/log/nginx/access.log \
        | sort -k3,3r -k2,2rn \
        | column -t > $TEMPFILE

    if [ "${PIPESTATUS[0]}" -ne 0 ]; then
        mail -s "log watch report" prescott.nolan@gmail.com <$TEMPFILE
    fi

The interesting points to note are the flags to sort -k3,3r and -k2,2rn which indicate a sort on two indices, the primary on the third column in reverse order and the secondary on the second column on reverse numeric order. The result is a report sorted by top requests per subnetwork per day:

188.143.232  1010  14/Feb/2016
188.143.232  2795  13/Feb/2016
222.186.31   1     13/Feb/2016
188.143.232  2753  12/Feb/2016
221.178.182  3     12/Feb/2016
176.94.194   1     12/Feb/2016

I have a check in the AWK script to exit 1 in the event of 500+ requests per day, but it is part of a pipeline of tasks in producing the report. I use the PIPESTATUS array to check GAWK's exit code and if there are more than 500 requests I email the report to myself to investigate.

The Enemy of The Good¹

While not a perfect solution, I have found this one sufficient for my needs. It is a quick hack that will probably never be rewritten or added to unless I decide the final report needs to look better. If I had decided to make the whole thing "perfect" in identifying and blacklisting without human intervention I would probably still be writing it.