Hot on the heels of my last post about a small project in J I thought it might be interesting to follow along with Allen Downey's Think Stats, which was written to use a number of libraries from the author as well as NumPy and matplotlib. Almost immediately I've found a bit of a side quest in just getting setup to follow along.
Chapter one deals with getting setup with the necessary libraries and data. The dataset in use is the National Survey of Family Growth from the CDC.
Provided with the book are libraries to parse and filter data automatically, along with some descriptive text on some of the recoding in the collection and post-processing. The data itself is an enormous text file with hundreds of columns of data in a fixed width report. There are no delimiters or headers to indicate which field is which or even where they begin and end.
An example of the data, here the first 6 lines of a file that is 13,593 lines long:
1 1 6 1 11093 1084 9 039 9 0 1 813 1093 13837 1 5 116610931166 9201093 111 3 1 12 11 5391 110933316108432411211 2995 1212 69544441116122222 2 2224693215 000000000000000000000000000000000000003410.38939935294273869.3496019830486 6448.2711117047512 91231
1 2 6 1 11166 1157 9 039 9 0 2 714 1166 6542112 2 05 1 4 5 51 1 41 4 201 20 1166109311661093116611661231 111 3 1 14 11 5391 211663925115738501211 2 432 8701414 69544441116122222 2 2224693215 000000000000000000000000000000000000003410.38939935294273869.3496019830486 6448.2711117047512 91231
2 1 5 35 11156 1147 03939 9 0 1 9 2 2 2 0 1 1 4 1156 7524 51 5 5551156 0 5551156 0 12041156120411531156 5 55 4 5 5511 5391 111561433114713585165 2995 5555 98420205611 71232 3 5321002315 00000000000000000000000000000000000000 7226.301739584323 8567.54910985691812999.5422643859022121231
2 2 6 1 11198 1189 03939 9 0 2 7 0 1198 332513 3 05 1 4 5 551 1205 72 15 12041156120411561198 4 5 4 3115511 2 32 1391 211981783118917085165 3 44299505353 98420205611 71232 3 5321002315 00000000000000000000000000000000000000 7226.301739584323 8567.54910985691812999.5422643859022121231
2 3 6 1 11204 1195 03939 9 0 2 6 3 1204 272513 2 05 1 4 152 551 1221 172 15 1204115612041198120412041231 4 5 4 55 11 4 42 1391 312041833119517585165 2 44299535555 98420205611 71232 3 5321002315 00000000000000000000000000000000000000 7226.301739584323 8567.54910985691812999.5422643859022121231
6 1 6 1 11074 1065 03838 9 0 1 8 9 1074 15727 1 1 11 1 21 2 11121074111210321074 15 1 1 2 1 11 11 5381 110742700106526251211 2 9 2323 75040401118122222 3 3222167315 00000000000000000000000000000000000000 4870.926434970875 5325.19699923372 8874.4407992229951231231
To my knowledge, J does not include a facility for reading fixed-width formats into an array so I figured I'd do it myself.
The canonical reference for the CDC data is a "code book" available online as a single huge HTML page where each field garners a description, width and type (raw vs. computed). For example:
PRIORSMK(92-92)
Variable Type : Raw
BE-3 : Please look at Card 17. In the 6 months before you found out you were pregnant this (nth) time, how many cigarettes did you smoke a day, on the average?
value | label | Total | |
---|---|---|---|
. | INAPPLICABLE | 9581 | |
0 | NONE | 2994 | |
1 | ABOUT ONE CIGARETTE A DAY OR LESS | 86 | |
2 | JUST A FEW CIGARETTES A DAY (2-4) | 178 | |
3 | ABOUT HALF A PACK A DAY (5-14) | 382 | |
4 | ABOUT A PACK A DAY (15-24) | 296 | |
5 | ABOUT 1 1/2 PACKS A DAY (25-34) | 49 | |
6 | ABOUT 2 PACKS A DAY (35-44) | 24 | |
7 | MORE THAN 2 PACKS A DAY (45 OR MORE) | 2 | |
8 | REFUSED | 1 | |
Total | 13593 |
Universe : Applicable if this is a completed pregnancy ending in January 1997 or later (cmbabdob or cmotpreg ge cmjan97) and it did not end in induced abortion (BC-1 PREGEND1 and PREGEND2 did not include code 3)
In the above case, the only piece of information I need to parse the fixed width file is in the first row PRIORSMK(92-92), and really it is only the second number I'm interested in. In this particular case the field is 1 character wide, hence from character 92 to character 92. Other columns can be wider.
In a small stroke of good luck, the HTML is very regular and is pretty
easy to scrape directly from the browser's developer tools. In this
case each heading is contained within a <b>
(bold) tag inside of a
<p>
(paragraph) tag. These field definitions are the only instance
of that particular pattern, so we can just scoop them all up without
filtering using JavaScript:
>> document.querySelectorAll('p>b')
NodeList(243) [ b, b, b, b, b, b, b, b, b, b, … ]
Even better would be to strip the extraneous text along the way - easy enough with a regular expression:
>> "PRIORSMK(92-92)".replace(/.*\d+-(\d+).*/, (s1, s2)=>s2)
92
Finally, because I'm dumping this all into J I may as well collect it in a format that is a valid J array with this gross one-liner:
>> Array.from(
document.querySelectorAll('p>b')
).map(i => i.innerText.replace(
/.*\d+-(\d+).*/, (s1, s2)=>s2)
).join(" ")
"12 14 16 17 18 19 20 21 22 23 27 28 32 33 37 39 41 43 45 47 49 50 51 52 53 54 55 56 58 60 61 62 64 66 67 68 70 72 73 77 81 83 84 85 86 87 89 90 91 92 93 94 95 97 98 99 100 101 102 104 105 107 108 109 110 114 118 121 122 123 124 125 126 129 130 133 134 137 138 141 142 143 144 148 152 155 156 157 158 159 160 162 163 165 166 168 169 171 172 173 174 178 182 185 186 187 188 189 190 191 192 193 194 195 196 197 201 205 209 213 217 221 225 229 233 234 235 236 238 240 242 244 245 246 247 248 249 250 253 254 255 256 258 259 260 261 262 263 264 266 268 269 270 271 272 273 274 276 277 279 283 287 291 295 296 297 298 299 301 303 304 305 308 309 310 311 312 313 317 319 321 322 323 325 327 328 329 330 331 333 335 336 337 340 341 342 343 344 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 404 422 440 441 443 447"
With all of the necessary field length information it's time to turn to the J console, where I'd like to ingest it and the file and end up with a navigable set of data. First, to assign the big array I found and check just how many fields there are:
rawIndices=. 12 14 16 17 18 19 20 21 22 23 27 28 32 33 37 39 41 43 45 47 49 50 51 52 53 54 55 56 58 60 61 62 64 66 67 68 70 72 73 77 81 83 84 85 86 87 89 90 91 92 93 94 95 97 98 99 100 101 102 104 105 107 108 109 110 114 118 121 122 123 124 125 126 129 130 133 134 137 138 141 142 143 144 148 152 155 156 157 158 159 160 162 163 165 166 168 169 171 172 173 174 178 182 185 186 187 188 189 190 191 192 193 194 195 196 197 201 205 209 213 217 221 225 229 233 234 235 236 238 240 242 244 245 246 247 248 249 250 253 254 255 256 258 259 260 261 262 263 264 266 268 269 270 271 272 273 274 276 277 279 283 287 291 295 296 297 298 299 301 303 304 305 308 309 310 311 312 313 317 319 321 322 323 325 327 328 329 330 331 333 335 336 337 340 341 342 343 344 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 404 422 440 441 443 447
#rawIndices
243
Having a brief look at the raw data shows:
rawBytes=. freads '2002FemPreg.dat'
$rawBytes
6089664
rawData=. >cutLF rawBytes
$rawData
13593 447
It's just 5.8 megabytes of text data with no real structure. The fixed
width format means things are line-oriented, so I use cutLF
(cut
line feed) to split out the lines into a 13,593 row table of a single
textual column of 447 characters.
J provides a helpful conjunction ;.
("cut") to partition an
array. The J Primer discusses it
some but largely in
the context of cutting along a sort of sentinel value. There is this
though:
Sometimes the partition information is separate from the data. Instead of frets in the data, the partition information can be provided in a left argument to the derived verb. The partition information is boolean data where a 1 indicates the start (with 1 or _1) or end (with 2 or _2) of the partitions.
With that in mind the goal becomes turning an array of indices into a binary array of patitioning points. To begin, I'll create all numbers up to the maximum index value:
someIndex=. 10
i. someIndex
0 1 2 3 4 5 6 7 8 9
maxIndex=. >./rawIndices
447
With a generated series of numbers up to 447, we want to check whether
any of the indeces are present, otherwise it is between an index (and
thus not a partition point). For this I'll use e.
("member in") to
return a boolean array of those numbers from the rawIndices
:
(i. maxIndex) e. rawIndices
0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 1 1 1 1 1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 1 0 0 0 1 0 0 0 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 1 1 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 1 1 1 0 ...
This almost works, but the first index isn't the start of the first
record, that would be index "0" - not present in our array we
scraped. We can just prepend (,
) a 0 but this is also a good time to
try and make a tacit
verb from all of
this:
fmarkers=. i.@(>./) e. 0&,
That ends up being a "fork" of three verbs:
i.@(>./)
e.
0&,
If you think of them as f, g, and h when they are applied to an array y, the result is: (f y) g (h y). If this is troubling to you, that is okay.
With our partitioning mask at hand we are ready to "cut" the string
into discrete pieces of information. Here I'm using <
("box") to
enclose over each entry regardless of whether it is empty or not. The
rank modifier "1
serves to apply the cut to each record:
data=. (fmarkers rawIndices) <;.1"1 rawData
data
...
├────────────┼──┼──┼─┼─┼─┼─┼─┼─┼─┼────┼─┼────┼─┼────┼──┼──┼──┼──┼──┼──┼─┼─┼─┼─┼─┼─┼─┼──┼──┼─┼─┼──┼──┼─┼─┼──┼──┼─┼────┼────┼──┼─┼─┼─┼─┼──┼─┼─┼─┼─┼─┼─┼──┼─┼─┼─┼─┼─┼──┼─┼──┼─┼─┼─┼────┼────┼───┼─┼─┼─┼─┼─┼───┼─┼───┼─┼───┼─┼───┼─┼─┼─┼────┼────┼───┼─┼─┼─┼─┼─┼──┼─...
│ 92│ 1│ │ │ │ │6│ │1│ │ │1│ 913│ │ 904│ │ │ 9│ 0│39│ 9│ │ │ │0│ │ │2│ 6│ 5│ │ │ │ │ │ │ │ │ │ 913│ 322│20│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ...
├────────────┼──┼──┼─┼─┼─┼─┼─┼─┼─┼────┼─┼────┼─┼────┼──┼──┼──┼──┼──┼──┼─┼─┼─┼─┼─┼─┼─┼──┼──┼─┼─┼──┼──┼─┼─┼──┼──┼─┼────┼────┼──┼─┼─┼─┼─┼──┼─┼─┼─┼─┼─┼─┼──┼─┼─┼─┼─┼─┼──┼─┼──┼─┼─┼─┼────┼────┼───┼─┼─┼─┼─┼─┼───┼─┼───┼─┼───┼─┼───┼─┼─┼─┼────┼────┼───┼─┼─┼─┼─┼─┼──┼─...
│ 92│ 2│ │ │ │ │6│ │1│ │ │1│ 950│ │ 941│ │ │ 9│ 0│39│ 9│ │ │ │0│ │ │1│ 6│ 5│ │ │ │ │ │ │ │ │ │ 950│ 285│23│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ...
├────────────┼──┼──┼─┼─┼─┼─┼─┼─┼─┼────┼─┼────┼─┼────┼──┼──┼──┼──┼──┼──┼─┼─┼─┼─┼─┼─┼─┼──┼──┼─┼─┼──┼──┼─┼─┼──┼──┼─┼────┼────┼──┼─┼─┼─┼─┼──┼─┼─┼─┼─┼─┼─┼──┼─┼─┼─┼─┼─┼──┼─┼──┼─┼─┼─┼────┼────┼───┼─┼─┼─┼─┼─┼───┼─┼───┼─┼───┼─┼───┼─┼─┼─┼────┼────┼───┼─┼─┼─┼─┼─┼──┼─...
│ 92│ 3│ │ │ │ │5│ │1│ │ │1│1069│ │1060│ │ │ 9│ 0│39│ 9│ │ │ │0│ │ │1│ 6│ 5│ │ │ │ │ │ │ │ │ │1069│ 166│36│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │5│1│ │ │ │ │ │ │ │5│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ...
├────────────┼──┼──┼─┼─┼─┼─┼─┼─┼─┼────┼─┼────┼─┼────┼──┼──┼──┼──┼──┼──┼─┼─┼─┼─┼─┼─┼─┼──┼──┼─┼─┼──┼──┼─┼─┼──┼──┼─┼────┼────┼──┼─┼─┼─┼─┼──┼─┼─┼─┼─┼─┼─┼──┼─┼─┼─┼─┼─┼──┼─┼──┼─┼─┼─┼────┼────┼───┼─┼─┼─┼─┼─┼───┼─┼───┼─┼───┼─┼───┼─┼─┼─┼────┼────┼───┼─┼─┼─┼─┼─┼──┼─...
│ 95│ 1│ │ │ │ │3│ │ │ │1135│2│1135│1│1133│17│22│ 2│ 0│ 9│ 2│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ...
├────────────┼──┼──┼─┼─┼─┼─┼─┼─┼─┼────┼─┼────┼─┼────┼──┼──┼──┼──┼──┼──┼─┼─┼─┼─┼─┼─┼─┼──┼──┼─┼─┼──┼──┼─┼─┼──┼──┼─┼────┼────┼──┼─┼─┼─┼─┼──┼─┼─┼─┼─┼─┼─┼──┼─┼─┼─┼─┼─┼──┼─┼──┼─┼─┼─┼────┼────┼───┼─┼─┼─┼─┼─┼───┼─┼───┼─┼───┼─┼───┼─┼─┼─┼────┼────┼───┼─┼─┼─┼─┼─┼──┼─...
│ 95│ 2│ │ │ │ │6│ │1│ │ │1│1146│ │1137│ │ │ 9│ 0│39│ 9│ │ │ │0│ │ │1│ 7│ 6│ │ │ │ │ │ │ │ │ │1146│ 88│23│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │1│ │ │ │ │ │ │ │ │1│ │ 2│3│ 0│ │ 4│1│ 4│ │ │ │ │ │ │ │ │ │ │ │ │ ...
├────────────┼──┼──┼─┼─┼─┼─┼─┼─┼─┼────┼─┼────┼─┼────┼──┼──┼──┼──┼──┼──┼─┼─┼─┼─┼─┼─┼─┼──┼──┼─┼─┼──┼──┼─┼─┼──┼──┼─┼────┼────┼──┼─┼─┼─┼─┼──┼─┼─┼─┼─┼─┼─┼──┼─┼─┼─┼─┼─┼──┼─┼──┼─┼─┼─┼────┼────┼───┼─┼─┼─┼─┼─┼───┼─┼───┼─┼───┼─┼───┼─┼─┼─┼────┼────┼───┼─┼─┼─┼─┼─┼──┼─...
│ 101│ 1│ │ │ │ │6│ │1│ │ │1│1192│ │1183│ │ │ 9│ 0│39│ 9│ │ │ │0│ │ │1│ 8│ 3│ │ │ │ │ │ │ │ │ │1192│ 43│99│1│4│ │ │ 5│ │ │1│5│ │5│ │ │ │5│ │ │ │ │ │1│ │ │ │ │ │ │ │ │1│ │ 10│1│ 10│ │ 12│1│ 12│ │ │ │ │ │ │ │ │ │ │ │ │ ...
...
$data
13593 243
And like that, we've got 13,593 rows of 243 columns.
I'm not unhappy with the result, but it is a little comical getting here and realizing I'm only just getting setup to do what I had intended in the first place. I do feel like I am picking up more of the language along the way though.
The entire J program:
rawIndices=. 12 14 16 17 18 19 20 21 22 23 27 28 32 33 37 39 41 43 45 47 49 50 51 52 53 54 55 56 58 60 61 62 64 66 67 68 70 72 73 77 81 83 84 85 86 87 89 90 91 92 93 94 95 97 98 99 100 101 102 104 105 107 108 109 110 114 118 121 122 123 124 125 126 129 130 133 134 137 138 141 142 143 144 148 152 155 156 157 158 159 160 162 163 165 166 168 169 171 172 173 174 178 182 185 186 187 188 189 190 191 192 193 194 195 196 197 201 205 209 213 217 221 225 229 233 234 235 236 238 240 242 244 245 246 247 248 249 250 253 254 255 256 258 259 260 261 262 263 264 266 268 269 270 271 272 273 274 276 277 279 283 287 291 295 296 297 298 299 301 303 304 305 308 309 310 311 312 313 317 319 321 322 323 325 327 328 329 330 331 333 335 336 337 340 341 342 343 344 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 404 422 440 441 443 447
rawBytes=. freads '2002FemPreg.dat'
rawData=. >cutLF rawBytes
fmarkers=. i.@(>./) e. 0&,
boxedData=. (fmarkers rawIndices) <;.1"1 rawData
Somewhat disconcertingly, if you ignore the rawIndices
array (since
it is "data" more than "code"), the J program is very nearly the same
length as the "one-liner" used to scrape the CDC website.
It should be noted that the boxed values are still non-numeric. I
haven't yet figured out what an appropriate placeholder value is (I'm
hoping the book touches on it) but there will be a necessary step to
convert. The following seems to work, with the __
shown being
"negative infinity":
numericData=. (__".>"1) boxedData
numericData
1 1 __ __ __ __ 6 __ 1 __ __ 1 1093 __ 1084 __ __ 9 0 39 9 __ __ __ 0 __ __ 1 8 13 __ __ __ __ __ __ __ __ __ 1093 138 37 __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ 1 __ __ __ __ __ __ __ __ 5 __ __ __ __ __ __ __ __ __ _...
1 2 __ __ __ __ 6 __ 1 __ __ 1 1166 __ 1157 __ __ 9 0 39 9 __ __ __ 0 __ __ 2 7 14 __ __ __ __ __ __ __ __ __ 1166 65 42 1 1 2 __ 2 __ __ 0 5 __ 1 4 __ __ 5 __ __ __ __ __ 5 1 __ __ __ __ __ __ __ 1 __ 4 1 4 __ 20 1 20 __ _...
2 1 __ __ __ __ 5 __ 3 5 __ 1 1156 __ 1147 __ __ 0 39 39 9 __ __ __ 0 __ __ 1 9 2 __ 2 2 0 __ 1 1 4 __ 1156 75 24 __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ 5 1 __ __ __ __ __ __ __ 5 __ __ __ __ __ __ __ __ 5 ...
2 2 __ __ __ __ 6 __ 1 __ __ 1 1198 __ 1189 __ __ 0 39 39 9 __ __ __ 0 __ __ 2 7 0 __ __ __ __ __ __ __ __ __ 1198 33 25 1 3 __ __ 3 __ __ 0 5 __ 1 4 __ __ 5 __ __ __ __ __ 5 5 1 __ 1205 7 2 __ 1 5 __ __ __ __ __ __ __ __ __ _...
2 3 __ __ __ __ 6 __ 1 __ __ 1 1204 __ 1195 __ __ 0 39 39 9 __ __ __ 0 __ __ 2 6 3 __ __ __ __ __ __ __ __ __ 1204 27 25 1 3 __ __ 2 __ __ 0 5 __ 1 4 __ __ 1 5 2 __ __ __ 5 5 1 __ 1221 17 2 __ 1 5 __ __ __ __ __ __ __ __ __ _...
6 1 __ __ __ __ 6 __ 1 __ __ 1 1074 __ 1065 __ __ 0 38 38 9 __ __ __ 0 __ __ 1 8 9 __ __ __ __ __ __ __ __ __ 1074 157 27 __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ 1 __ __ __ __ __ __ __ __ 1 __ 1 1 1 __ 2 1 2 __ _...
6 2 __ __ __ __ 6 __ 1 __ __ 1 1096 __ 1087 __ __ 0 40 40 9 __ __ __ 0 __ __ 2 9 9 __ __ __ __ __ __ __ __ __ 1096 135 29 __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ 1 __ __ __ __ __ __ __ __ 1 __ 2 1 2 __ 4 1 4 __ _...
6 3 __ __ __ __ 6 __ 1 __ __ 1 1112 __ 1102 __ __ 0 42 42 10 __ __ __ 0 __ __ 2 8 6 __ __ __ __ __ __ __ __ __ 1112 119 30 __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ 1 __ __ __ __ __ __ __ __ 1 __ 2 1 2 __ 8 1 8 __ _...
7 1 __ __ __ __ 5 __ 1 __ __ 1 1095 __ 1086 __ __ 9 0 39 9 __ __ __ 0 __ __ 1 7 9 __ __ __ __ __ __ __ __ __ 1095 138 31 __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ 1 __ __ __ __ __ __ __ __ 5 __ __ __ __ __ __ __ __ __ _...
...