indexpost archiveatom feed syndication feed icon

Yak Shaving and Data Munging (with J)

2020-03-30

Hot on the heels of my last post about a small project in J I thought it might be interesting to follow along with Allen Downey's Think Stats, which was written to use a number of libraries from the author as well as NumPy and matplotlib. Almost immediately I've found a bit of a side quest in just getting setup to follow along.

Background

Chapter one deals with getting setup with the necessary libraries and data. The dataset in use is the National Survey of Family Growth from the CDC.

Provided with the book are libraries to parse and filter data automatically, along with some descriptive text on some of the recoding in the collection and post-processing. The data itself is an enormous text file with hundreds of columns of data in a fixed width report. There are no delimiters or headers to indicate which field is which or even where they begin and end.

An example of the data, here the first 6 lines of a file that is 13,593 lines long:

           1 1     6 1     11093 1084     9 039 9   0  1 813             1093 13837                        1                5                                                                        116610931166 9201093                111             3    1   12  11         5391 110933316108432411211     2995 1212 69544441116122222 2 2224693215    000000000000000000000000000000000000003410.38939935294273869.3496019830486 6448.2711117047512 91231
           1 2     6 1     11166 1157     9 039 9   0  2 714             1166  6542112  2  05 1 4  5       51               1   41  4  201 20                                                        1166109311661093116611661231        111             3    1   14  11         5391 211663925115738501211 2 432 8701414 69544441116122222 2 2224693215    000000000000000000000000000000000000003410.38939935294273869.3496019830486 6448.2711117047512 91231
           2 1     5 35    11156 1147     03939 9   0  1 9 2 2 2 0 1 1 4 1156  7524                        51               5                5551156      0                5551156      0            12041156120411531156                5          55         4  5 5511         5391 111561433114713585165     2995 5555 98420205611 71232 3 5321002315    00000000000000000000000000000000000000 7226.301739584323 8567.54910985691812999.5422643859022121231
           2 2     6 1     11198 1189     03939 9   0  2 7 0             1198  332513   3  05 1 4  5       551    1205  72 15                                                                        12041156120411561198                    4       5         4 3115511 2 32    1391 211981783118917085165 3 44299505353 98420205611 71232 3 5321002315    00000000000000000000000000000000000000 7226.301739584323 8567.54910985691812999.5422643859022121231
           2 3     6 1     11204 1195     03939 9   0  2 6 3             1204  272513   2  05 1 4  152     551    1221 172 15                                                                        1204115612041198120412041231            4       5         4 55 11   4 42    1391 312041833119517585165 2 44299535555 98420205611 71232 3 5321002315    00000000000000000000000000000000000000 7226.301739584323 8567.54910985691812999.5422643859022121231
           6 1     6 1     11074 1065     03838 9   0  1 8 9             1074 15727                        1                1   11  1   21  2                                                        11121074111210321074                15  1       1   2    1   11  11         5381 110742700106526251211     2  9 2323 75040401118122222 3 3222167315    00000000000000000000000000000000000000 4870.926434970875  5325.19699923372 8874.4407992229951231231

To my knowledge, J does not include a facility for reading fixed-width formats into an array so I figured I'd do it myself.

The Task

Collect all of the Column Widths

The canonical reference for the CDC data is a "code book" available online as a single huge HTML page where each field garners a description, width and type (raw vs. computed). For example:

PRIORSMK(92-92)

Variable Type : Raw

BE-3 : Please look at Card 17. In the 6 months before you found out you were pregnant this (nth) time, how many cigarettes did you smoke a day, on the average?

value label   Total
. INAPPLICABLE   9581
0 NONE   2994
1 ABOUT ONE CIGARETTE A DAY OR LESS   86
2 JUST A FEW CIGARETTES A DAY (2-4)   178
3 ABOUT HALF A PACK A DAY (5-14)   382
4 ABOUT A PACK A DAY (15-24)   296
5 ABOUT 1 1/2 PACKS A DAY (25-34)   49
6 ABOUT 2 PACKS A DAY (35-44)   24
7 MORE THAN 2 PACKS A DAY (45 OR MORE)   2
8 REFUSED   1
  Total   13593

Universe : Applicable if this is a completed pregnancy ending in January 1997 or later (cmbabdob or cmotpreg ge cmjan97) and it did not end in induced abortion (BC-1 PREGEND1 and PREGEND2 did not include code 3)

In the above case, the only piece of information I need to parse the fixed width file is in the first row PRIORSMK(92-92), and really it is only the second number I'm interested in. In this particular case the field is 1 character wide, hence from character 92 to character 92. Other columns can be wider.

In a small stroke of good luck, the HTML is very regular and is pretty easy to scrape directly from the browser's developer tools. In this case each heading is contained within a <b> (bold) tag inside of a <p> (paragraph) tag. These field definitions are the only instance of that particular pattern, so we can just scoop them all up without filtering using JavaScript:

>> document.querySelectorAll('p>b')
NodeList(243) [ b, b, b, b, b, b, b, b, b, b, … ]

Even better would be to strip the extraneous text along the way - easy enough with a regular expression:

>> "PRIORSMK(92-92)".replace(/.*\d+-(\d+).*/, (s1, s2)=>s2)
92

Finally, because I'm dumping this all into J I may as well collect it in a format that is a valid J array with this gross one-liner:

>> Array.from(
    document.querySelectorAll('p>b')
).map(i => i.innerText.replace(
    /.*\d+-(\d+).*/, (s1, s2)=>s2)
).join(" ")

"12 14 16 17 18 19 20 21 22 23 27 28 32 33 37 39 41 43 45 47 49 50 51 52 53 54 55 56 58 60 61 62 64 66 67 68 70 72 73 77 81 83 84 85 86 87 89 90 91 92 93 94 95 97 98 99 100 101 102 104 105 107 108 109 110 114 118 121 122 123 124 125 126 129 130 133 134 137 138 141 142 143 144 148 152 155 156 157 158 159 160 162 163 165 166 168 169 171 172 173 174 178 182 185 186 187 188 189 190 191 192 193 194 195 196 197 201 205 209 213 217 221 225 229 233 234 235 236 238 240 242 244 245 246 247 248 249 250 253 254 255 256 258 259 260 261 262 263 264 266 268 269 270 271 272 273 274 276 277 279 283 287 291 295 296 297 298 299 301 303 304 305 308 309 310 311 312 313 317 319 321 322 323 325 327 328 329 330 331 333 335 336 337 340 341 342 343 344 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 404 422 440 441 443 447"

Parsing a Fixed Width File

With all of the necessary field length information it's time to turn to the J console, where I'd like to ingest it and the file and end up with a navigable set of data. First, to assign the big array I found and check just how many fields there are:

   rawIndices=. 12 14 16 17 18 19 20 21 22 23 27 28 32 33 37 39 41 43 45 47 49 50 51 52 53 54 55 56 58 60 61 62 64 66 67 68 70 72 73 77 81 83 84 85 86 87 89 90 91 92 93 94 95 97 98 99 100 101 102 104 105 107 108 109 110 114 118 121 122 123 124 125 126 129 130 133 134 137 138 141 142 143 144 148 152 155 156 157 158 159 160 162 163 165 166 168 169 171 172 173 174 178 182 185 186 187 188 189 190 191 192 193 194 195 196 197 201 205 209 213 217 221 225 229 233 234 235 236 238 240 242 244 245 246 247 248 249 250 253 254 255 256 258 259 260 261 262 263 264 266 268 269 270 271 272 273 274 276 277 279 283 287 291 295 296 297 298 299 301 303 304 305 308 309 310 311 312 313 317 319 321 322 323 325 327 328 329 330 331 333 335 336 337 340 341 342 343 344 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 404 422 440 441 443 447

   #rawIndices
243

Having a brief look at the raw data shows:

   rawBytes=. freads '2002FemPreg.dat'

   $rawBytes
6089664

   rawData=. >cutLF rawBytes

   $rawData
13593 447

It's just 5.8 megabytes of text data with no real structure. The fixed width format means things are line-oriented, so I use cutLF (cut line feed) to split out the lines into a 13,593 row table of a single textual column of 447 characters.

J provides a helpful conjunction ;. ("cut") to partition an array. The J Primer discusses it some but largely in the context of cutting along a sort of sentinel value. There is this though:

Sometimes the partition information is separate from the data. Instead of frets in the data, the partition information can be provided in a left argument to the derived verb. The partition information is boolean data where a 1 indicates the start (with 1 or _1) or end (with 2 or _2) of the partitions.

With that in mind the goal becomes turning an array of indices into a binary array of patitioning points. To begin, I'll create all numbers up to the maximum index value:

   someIndex=. 10

   i. someIndex
0 1 2 3 4 5 6 7 8 9

   maxIndex=. >./rawIndices
447

With a generated series of numbers up to 447, we want to check whether any of the indeces are present, otherwise it is between an index (and thus not a partition point). For this I'll use e. ("member in") to return a boolean array of those numbers from the rawIndices:

   (i. maxIndex) e. rawIndices
0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 1 1 1 1 1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 1 0 0 0 1 0 0 0 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 1 1 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 1 1 1 0 ...

This almost works, but the first index isn't the start of the first record, that would be index "0" - not present in our array we scraped. We can just prepend (,) a 0 but this is also a good time to try and make a tacit verb from all of this:

   fmarkers=. i.@(>./) e. 0&,

That ends up being a "fork" of three verbs:

If you think of them as f, g, and h when they are applied to an array y, the result is: (f y) g (h y). If this is troubling to you, that is okay.

With our partitioning mask at hand we are ready to "cut" the string into discrete pieces of information. Here I'm using < ("box") to enclose over each entry regardless of whether it is empty or not. The rank modifier "1 serves to apply the cut to each record:

   data=. (fmarkers rawIndices) <;.1"1 rawData

   data
   ...
├────────────┼──┼──┼─┼─┼─┼─┼─┼─┼─┼────┼─┼────┼─┼────┼──┼──┼──┼──┼──┼──┼─┼─┼─┼─┼─┼─┼─┼──┼──┼─┼─┼──┼──┼─┼─┼──┼──┼─┼────┼────┼──┼─┼─┼─┼─┼──┼─┼─┼─┼─┼─┼─┼──┼─┼─┼─┼─┼─┼──┼─┼──┼─┼─┼─┼────┼────┼───┼─┼─┼─┼─┼─┼───┼─┼───┼─┼───┼─┼───┼─┼─┼─┼────┼────┼───┼─┼─┼─┼─┼─┼──┼─...
│          92│ 1│  │ │ │ │6│ │1│ │    │1│ 913│ │ 904│  │  │ 9│ 0│39│ 9│ │ │ │0│ │ │2│ 6│ 5│ │ │  │  │ │ │  │  │ │ 913│ 322│20│ │ │ │ │  │ │ │ │ │ │ │  │ │ │ │ │ │  │ │  │ │ │ │    │    │   │ │ │ │ │ │   │ │   │ │   │ │   │ │ │ │    │    │   │ │ │ │ │ │  │ ...
├────────────┼──┼──┼─┼─┼─┼─┼─┼─┼─┼────┼─┼────┼─┼────┼──┼──┼──┼──┼──┼──┼─┼─┼─┼─┼─┼─┼─┼──┼──┼─┼─┼──┼──┼─┼─┼──┼──┼─┼────┼────┼──┼─┼─┼─┼─┼──┼─┼─┼─┼─┼─┼─┼──┼─┼─┼─┼─┼─┼──┼─┼──┼─┼─┼─┼────┼────┼───┼─┼─┼─┼─┼─┼───┼─┼───┼─┼───┼─┼───┼─┼─┼─┼────┼────┼───┼─┼─┼─┼─┼─┼──┼─...
│          92│ 2│  │ │ │ │6│ │1│ │    │1│ 950│ │ 941│  │  │ 9│ 0│39│ 9│ │ │ │0│ │ │1│ 6│ 5│ │ │  │  │ │ │  │  │ │ 950│ 285│23│ │ │ │ │  │ │ │ │ │ │ │  │ │ │ │ │ │  │ │  │ │ │ │    │    │   │ │ │ │ │ │   │ │   │ │   │ │   │ │ │ │    │    │   │ │ │ │ │ │  │ ...
├────────────┼──┼──┼─┼─┼─┼─┼─┼─┼─┼────┼─┼────┼─┼────┼──┼──┼──┼──┼──┼──┼─┼─┼─┼─┼─┼─┼─┼──┼──┼─┼─┼──┼──┼─┼─┼──┼──┼─┼────┼────┼──┼─┼─┼─┼─┼──┼─┼─┼─┼─┼─┼─┼──┼─┼─┼─┼─┼─┼──┼─┼──┼─┼─┼─┼────┼────┼───┼─┼─┼─┼─┼─┼───┼─┼───┼─┼───┼─┼───┼─┼─┼─┼────┼────┼───┼─┼─┼─┼─┼─┼──┼─...
│          92│ 3│  │ │ │ │5│ │1│ │    │1│1069│ │1060│  │  │ 9│ 0│39│ 9│ │ │ │0│ │ │1│ 6│ 5│ │ │  │  │ │ │  │  │ │1069│ 166│36│ │ │ │ │  │ │ │ │ │ │ │  │ │ │ │ │ │  │ │  │5│1│ │    │    │   │ │ │ │5│ │   │ │   │ │   │ │   │ │ │ │    │    │   │ │ │ │ │ │  │ ...
├────────────┼──┼──┼─┼─┼─┼─┼─┼─┼─┼────┼─┼────┼─┼────┼──┼──┼──┼──┼──┼──┼─┼─┼─┼─┼─┼─┼─┼──┼──┼─┼─┼──┼──┼─┼─┼──┼──┼─┼────┼────┼──┼─┼─┼─┼─┼──┼─┼─┼─┼─┼─┼─┼──┼─┼─┼─┼─┼─┼──┼─┼──┼─┼─┼─┼────┼────┼───┼─┼─┼─┼─┼─┼───┼─┼───┼─┼───┼─┼───┼─┼─┼─┼────┼────┼───┼─┼─┼─┼─┼─┼──┼─...
│          95│ 1│  │ │ │ │3│ │ │ │1135│2│1135│1│1133│17│22│ 2│ 0│ 9│ 2│ │ │ │ │ │ │ │  │  │ │ │  │  │ │ │  │  │ │    │    │  │ │ │ │ │  │ │ │ │ │ │ │  │ │ │ │ │ │  │ │  │ │ │ │    │    │   │ │ │ │ │ │   │ │   │ │   │ │   │ │ │ │    │    │   │ │ │ │ │ │  │ ...
├────────────┼──┼──┼─┼─┼─┼─┼─┼─┼─┼────┼─┼────┼─┼────┼──┼──┼──┼──┼──┼──┼─┼─┼─┼─┼─┼─┼─┼──┼──┼─┼─┼──┼──┼─┼─┼──┼──┼─┼────┼────┼──┼─┼─┼─┼─┼──┼─┼─┼─┼─┼─┼─┼──┼─┼─┼─┼─┼─┼──┼─┼──┼─┼─┼─┼────┼────┼───┼─┼─┼─┼─┼─┼───┼─┼───┼─┼───┼─┼───┼─┼─┼─┼────┼────┼───┼─┼─┼─┼─┼─┼──┼─...
│          95│ 2│  │ │ │ │6│ │1│ │    │1│1146│ │1137│  │  │ 9│ 0│39│ 9│ │ │ │0│ │ │1│ 7│ 6│ │ │  │  │ │ │  │  │ │1146│  88│23│ │ │ │ │  │ │ │ │ │ │ │  │ │ │ │ │ │  │ │  │1│ │ │    │    │   │ │ │ │1│ │  2│3│  0│ │  4│1│  4│ │ │ │    │    │   │ │ │ │ │ │  │ ...
├────────────┼──┼──┼─┼─┼─┼─┼─┼─┼─┼────┼─┼────┼─┼────┼──┼──┼──┼──┼──┼──┼─┼─┼─┼─┼─┼─┼─┼──┼──┼─┼─┼──┼──┼─┼─┼──┼──┼─┼────┼────┼──┼─┼─┼─┼─┼──┼─┼─┼─┼─┼─┼─┼──┼─┼─┼─┼─┼─┼──┼─┼──┼─┼─┼─┼────┼────┼───┼─┼─┼─┼─┼─┼───┼─┼───┼─┼───┼─┼───┼─┼─┼─┼────┼────┼───┼─┼─┼─┼─┼─┼──┼─...
│         101│ 1│  │ │ │ │6│ │1│ │    │1│1192│ │1183│  │  │ 9│ 0│39│ 9│ │ │ │0│ │ │1│ 8│ 3│ │ │  │  │ │ │  │  │ │1192│  43│99│1│4│ │ │ 5│ │ │1│5│ │5│  │ │ │5│ │ │  │ │  │1│ │ │    │    │   │ │ │ │1│ │ 10│1│ 10│ │ 12│1│ 12│ │ │ │    │    │   │ │ │ │ │ │  │ ...
   ...
   
   $data
13593 243

And like that, we've got 13,593 rows of 243 columns.

Now I'm Ready to Get Started

I'm not unhappy with the result, but it is a little comical getting here and realizing I'm only just getting setup to do what I had intended in the first place. I do feel like I am picking up more of the language along the way though.

The entire J program:

rawIndices=. 12 14 16 17 18 19 20 21 22 23 27 28 32 33 37 39 41 43 45 47 49 50 51 52 53 54 55 56 58 60 61 62 64 66 67 68 70 72 73 77 81 83 84 85 86 87 89 90 91 92 93 94 95 97 98 99 100 101 102 104 105 107 108 109 110 114 118 121 122 123 124 125 126 129 130 133 134 137 138 141 142 143 144 148 152 155 156 157 158 159 160 162 163 165 166 168 169 171 172 173 174 178 182 185 186 187 188 189 190 191 192 193 194 195 196 197 201 205 209 213 217 221 225 229 233 234 235 236 238 240 242 244 245 246 247 248 249 250 253 254 255 256 258 259 260 261 262 263 264 266 268 269 270 271 272 273 274 276 277 279 283 287 291 295 296 297 298 299 301 303 304 305 308 309 310 311 312 313 317 319 321 322 323 325 327 328 329 330 331 333 335 336 337 340 341 342 343 344 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 404 422 440 441 443 447
rawBytes=. freads '2002FemPreg.dat'
rawData=. >cutLF rawBytes

fmarkers=. i.@(>./) e. 0&,
boxedData=. (fmarkers rawIndices) <;.1"1 rawData

Somewhat disconcertingly, if you ignore the rawIndices array (since it is "data" more than "code"), the J program is very nearly the same length as the "one-liner" used to scrape the CDC website.

It should be noted that the boxed values are still non-numeric. I haven't yet figured out what an appropriate placeholder value is (I'm hoping the book touches on it) but there will be a necessary step to convert. The following seems to work, with the __ shown being "negative infinity":

   numericData=. (__".>"1) boxedData

   numericData
  1 1 __ __ __ __ 6 __  1 __   __ 1 1093 __ 1084 __ __  9  0 39  9 __ __ __  0 __ __  1  8 13 __ __ __ __ __ __ __ __ __ 1093 138 37 __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __  1 __ __ __   __  __ __ __ __  5 __  __ __ __ __  __ __ __ __ _...
  1 2 __ __ __ __ 6 __  1 __   __ 1 1166 __ 1157 __ __  9  0 39  9 __ __ __  0 __ __  2  7 14 __ __ __ __ __ __ __ __ __ 1166  65 42  1  1  2 __  2 __ __  0  5 __  1  4 __ __  5 __ __ __ __ __  5  1 __ __   __  __ __ __ __  1 __   4  1  4 __  20  1 20 __ _...
  2 1 __ __ __ __ 5 __  3  5   __ 1 1156 __ 1147 __ __  0 39 39  9 __ __ __  0 __ __  1  9  2 __  2  2  0 __  1  1  4 __ 1156  75 24 __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __  5  1 __ __   __  __ __ __ __  5 __  __ __ __ __  __ __ __  5  ...
  2 2 __ __ __ __ 6 __  1 __   __ 1 1198 __ 1189 __ __  0 39 39  9 __ __ __  0 __ __  2  7  0 __ __ __ __ __ __ __ __ __ 1198  33 25  1  3 __ __  3 __ __  0  5 __  1  4 __ __  5 __ __ __ __ __  5  5  1 __ 1205   7  2 __  1  5 __  __ __ __ __  __ __ __ __ _...
  2 3 __ __ __ __ 6 __  1 __   __ 1 1204 __ 1195 __ __  0 39 39  9 __ __ __  0 __ __  2  6  3 __ __ __ __ __ __ __ __ __ 1204  27 25  1  3 __ __  2 __ __  0  5 __  1  4 __ __  1  5  2 __ __ __  5  5  1 __ 1221  17  2 __  1  5 __  __ __ __ __  __ __ __ __ _...
  6 1 __ __ __ __ 6 __  1 __   __ 1 1074 __ 1065 __ __  0 38 38  9 __ __ __  0 __ __  1  8  9 __ __ __ __ __ __ __ __ __ 1074 157 27 __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __  1 __ __ __   __  __ __ __ __  1 __   1  1  1 __   2  1  2 __ _...
  6 2 __ __ __ __ 6 __  1 __   __ 1 1096 __ 1087 __ __  0 40 40  9 __ __ __  0 __ __  2  9  9 __ __ __ __ __ __ __ __ __ 1096 135 29 __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __  1 __ __ __   __  __ __ __ __  1 __   2  1  2 __   4  1  4 __ _...
  6 3 __ __ __ __ 6 __  1 __   __ 1 1112 __ 1102 __ __  0 42 42 10 __ __ __  0 __ __  2  8  6 __ __ __ __ __ __ __ __ __ 1112 119 30 __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __  1 __ __ __   __  __ __ __ __  1 __   2  1  2 __   8  1  8 __ _...
  7 1 __ __ __ __ 5 __  1 __   __ 1 1095 __ 1086 __ __  9  0 39  9 __ __ __  0 __ __  1  7  9 __ __ __ __ __ __ __ __ __ 1095 138 31 __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __  1 __ __ __   __  __ __ __ __  5 __  __ __ __ __  __ __ __ __ _...
  ...