Profiling Python

2016-12-07

I've always known about profiling in Python, but never bothered to dig too deeply into it. My programs are typically fast enough and it would be more work to speed them up. I've decided to try and change that though and dip my toes into the world of profiling and see if I can't divine some architectural insights into a project of my own - Static.

What is Static?

Quick recap, Static is a my own static site generator, used to generate this site from a collection of Markdown files and some very basic HTML templates. The whole thing is available to browse on GitHub. High-level: it uses Mistune and Jinja and doesn't do anything fancy. This means it recreates the entire site from scratch every time it is run, reading .md files, parsing, generating HTML and then templating - progressively more every time I write another post.

All of this means there is a foreseeable future in which things become unbearably slow because I've simply written too much and the generator is too naive. That's not yet the case, this site is regenerated in about ⁵⁄_10ths of a second. But it could happen!

How Do You Profile Python?

The basics of profiling seem pretty well covered between two projects that I looked at, cProfile and line_profiler and it would seem the Python community has more or less standardized between these two. Fundamentally, they watch a program run and then either breakdown or aggregate the associated execution times for each function or line of code. I began with cProfile because it is built into the standard library and seemed approachable.

cProfile

The profiler is only any good if you actually exercise the various code paths through the program, I debated and fiddled with running against my test suite originally, but the run times are less impactful there due to the limited body of data being worked with. Instead, I've opted to simply run the profiler against my actual blog instance with all of the post data that entails.

The command necessary to invoke cProfile is as follows:

$ python -m cProfile -s tottime ../static/static.py 
         511796 function calls (488533 primitive calls) in 0.557 seconds


|    ncalls | tottime | percall | cumtime | percall | filename:lineno(function)                      |
|-----------+---------+---------+---------+---------+------------------------------------------------|
|       275 |   0.154 |   0.001 |   0.155 |   0.001 | {built-in method io.open}                      |
|     57279 |   0.050 |   0.000 |   0.050 |   0.000 | {method 'match' of '_sre.SRE_Pattern' objects} |
|        78 |   0.017 |   0.000 |   0.017 |   0.000 | {built-in method builtins.compile}             |
| 2666/2423 |   0.015 |   0.000 |   0.054 |   0.000 | mistune.py:550(manipulate)                     |
|     79503 |   0.014 |   0.000 |   0.014 |   0.000 | {built-in method builtins.getattr}             |
|     15865 |   0.012 |   0.000 |   0.026 |   0.000 | nodes.py:164(iter_child_nodes)                 |
|       201 |   0.010 |   0.000 |   0.011 |   0.000 | sre_compile.py:248(_optimize_charset)          |
|      3288 |   0.009 |   0.000 |   0.023 |   0.000 | lexer.py:594(tokeniter)                        |
|     51080 |   0.009 |   0.000 |   0.009 |   0.000 | {built-in method builtins.isinstance}          |
|     26671 |   0.008 |   0.000 |   0.011 |   0.000 | nodes.py:148(iter_fields)                      |
|        57 |   0.007 |   0.000 |   0.007 |   0.000 | {built-in method marshal.loads}                |
|      3552 |   0.007 |   0.000 |   0.007 |   0.000 | {method 'sub' of '_sre.SRE_Pattern' objects}   |
|  6435/154 |   0.007 |   0.000 |   0.078 |   0.001 | visitor.py:34(visit)                           |
| 1534/1076 |   0.006 |   0.000 |   0.034 |   0.000 | mistune.py:216(manipulate)                     |

Where, above, I've sorted on total time taken. Reading the reported results is a bit of a struggle but if you squint you can see at least one problem is hinted at. The single most expensive operation is in file I/O. I might have guessed as much, in this kind of program, but I've no context on whether that number is too high.

The same profile, sorted on cumulative time results in the following (abbreviated):


| ncalls | tottime | percall | cumtime | percall | filename:lineno(function)          |
|--------+---------+---------+---------+---------+------------------------------------|
|  151/1 |   0.006 |   0.000 |   0.577 |   0.577 | {built-in method builtins.exec}    |
|      1 |   0.000 |   0.000 |   0.577 |   0.577 | static.py:18(<module>)             |
|      1 |   0.001 |   0.001 |   0.388 |   0.388 | static.py:106(create_posts)        |
|     77 |   0.000 |   0.000 |   0.196 |   0.003 | environment.py:791(get_template)   |
|     77 |   0.001 |   0.000 |   0.195 |   0.003 | environment.py:768(_load_template) |
|     37 |   0.000 |   0.000 |   0.186 |   0.005 | static.py:43(templatize_post)      |
|     77 |   0.000 |   0.000 |   0.184 |   0.002 | loaders.py:99(load)                |
|     77 |   0.001 |   0.000 |   0.173 |   0.002 | environment.py:525(compile)        |
|    275 |   0.161 |   0.001 |   0.162 |   0.001 | {built-in method io.open}          |
|     37 |   0.000 |   0.000 |   0.114 |   0.003 | mistune.py:971(__call__)           |
|     37 |   0.000 |   0.000 |   0.114 |   0.003 | mistune.py:981(parse)              |
|     37 |   0.001 |   0.000 |   0.109 |   0.003 | mistune.py:1023(output)            |

The cumulative measure means the various parts are grouped so as to try and highlight which portions of the code are the hottest. In my case it's a bit lost because there is effectively a single code path at the top level (create_posts). The interesting portion is how many times get_template is being called and how relatively expensive it is (the same for _load_template, the both of which comprise most of templatize_post). I am, at this point, suspicious of my file access patterns because the profile seems to indicate 37 calls to template a post, but 77 calls to get the template, and 275 calls to open a file. For reference, at the time of profiling, I have 37 posts saved as Markdown files.

Line by Line Profiler

At this point, I think a diversion into a line-by-line profiler may be interesting. Rather than diving into my code to spot the mistakes based on cProfile alone, I thought I would assess the other tool in my profiling arsenal. Kernprof is a third-party package that aims to demystify profiling, and I think the results are pretty good. Despite the fact that I couldn't coerce pip into managing a download of line_profiler I pulled the source off of GitHub and installed the one missing dependency (Cython) in a few minutes without issue.

Quoting the project README here:

kernprof will create an instance of LineProfiler and insert it into the __builtins__ namespace with the name profile. It has been written to be used as a decorator, so in your script, you decorate the functions you want to profile with @profile.

I wrapped my "top-level" function seen above (create_posts) with the @profile decorator and launched kernprof with the following:

$ kernprof -l -v ../static/static.py
Wrote profile results to static.py.lprof
Timer unit: 1e-06 s

Total time: 0.560446 s
File: ../static/static.py
Function: create_posts at line 105

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   105                                               @profile
   106                                               def create_posts(self):
   107         1            5      5.0      0.0          posts_dir = self.config['posts_dir']
   108         1            1      1.0      0.0          output_dir = self.config['output_dir']
   109
   110        38         1074     28.3      0.2          for directory, filename in self.collect_posts(posts_dir):
   111        37          440     11.9      0.1              post_file = os.path.join(directory, filename)
   112        37         6604    178.5      1.2              raw_text = read_file(post_file)
   113        37          476     12.9      0.1              header_string, body = self.split_post(raw_text)
   114        37         8421    227.6      1.5              post = self.parse_post_parts(header_string, body)
   115        37       156343   4225.5     27.9              post['body'] = self.markdown(post['body'])
   116        37       306538   8284.8     54.7              templated_html = self.templatize_post(post)
   117        37          558     15.1      0.1              relative_dir = re.sub(posts_dir, '', directory, count=1)
   118
   119                                                       # in case I want to directly specify the generated URI/filename (as
   120                                                       # in the case of an index) without having to title it "index"
   121        37           33      0.9      0.0              if 'altname' in post:
   122         2            1      0.5      0.0                  _filename = post['altname']
   123                                                       else:
   124        35         1814     51.8      0.3                  _filename = '{}.html'.format(slugify(post['title']))
   125
   126        37          285      7.7      0.1              post['path'] = os.path.join(relative_dir, _filename)
   127        37          243      6.6      0.0              full_path = os.path.join(output_dir, relative_dir, _filename)
   128        37        77410   2092.2     13.8              write_output_file(templated_html, full_path)
   129
   130        37           86      2.3      0.0              if 'date' in post:
   131        36           72      2.0      0.0                  self.all_posts.append(post)
   132
   133         1           42     42.0      0.0          self.all_posts = self.sort_posts_by_date(self.all_posts)

Which, coming from cProfile as viewed above, is almost magical. Somewhat comfortingly, most of the body of create_posts is called 37 times, once for each post being created. This belies a problem hinted at in the output from cProfile above, but I am at least not doing anything totally asinine.

I will admit some surprise to find that templating the post body is almost exactly 2x the time taken to parse the Markdown of the post body. Mistune is really spot on when making claims about the speed of performance.

In case you haven't spotted my mistake, and one that I am apparently paying quite dearly for - for each of the 37 posts, I am reloading the posts template which never changes. Obviously this should be loaded once within the class instead.

Applying the following patch:

diff --git a/static.py b/static.py
index f540e48..6480f21 100755
--- a/static.py
+++ b/static.py
@@ -36,0 +37,2 @@ class StaticGenerator():
+        self._env = Environment(loader=FileSystemLoader(self.config['templates_dir']))
+        self._post_template = self._env.get_template('post.html')
@@ -43,4 +45,2 @@ class StaticGenerator():
-    def templatize_post(self, post_obj, template_name='post.html'):
-        env = Environment(loader=FileSystemLoader(self.config['templates_dir']))
-        template = env.get_template(template_name)
-        return template.render(post=post_obj)
+    def templatize_post(self, post_obj):
+        return self._post_template.render(post=post_obj)

Results in the following improvement.

$ kernprof -l -v ../static/static.py
Wrote profile results to static.py.lprof
Timer unit: 1e-06 s

Total time: 0.257144 s
File: ../static/static.py
Function: create_posts at line 105

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   115        37       155833   4211.7     60.6              post['body'] = self.markdown(post['body'])
   116        37        11963    323.3      4.7              templated_html = self.templatize_post(post)

$ python -m cProfile -s cumulative ../static/static.py 
         287730 function calls (280991 primitive calls) in 0.411 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     79/1    0.007    0.000    0.411    0.411 {built-in method builtins.exec}
        1    0.000    0.000    0.411    0.411 static.py:18(<module>)
        1    0.001    0.001    0.214    0.214 static.py:105(create_posts)
      167    0.147    0.001    0.147    0.001 {built-in method io.open}
       37    0.000    0.000    0.121    0.003 mistune.py:971(__call__)
       37    0.000    0.000    0.121    0.003 mistune.py:981(parse)
       37    0.001    0.000    0.117    0.003 mistune.py:1023(output)
    76/10    0.000    0.000    0.090    0.009 <frozen importlib._bootstrap>:966(_find_and_load)
    76/10    0.000    0.000    0.090    0.009 <frozen importlib._bootstrap>:939(_find_and_load_unlocked)
    73/10    0.000    0.000    0.088    0.009 <frozen importlib._bootstrap>:659(_load_unlocked)
    57/10    0.000    0.000    0.088    0.009 <frozen importlib._bootstrap_external>:656(exec_module)
    94/10    0.000    0.000    0.085    0.008 <frozen importlib._bootstrap>:214(_call_with_frames_removed)
        1    0.000    0.000    0.078    0.078 static.py:60(copy_media)
...

Thoughts

It is easy to imagine this process scaling up with more "real world" projects along similar lines to my experience here. I am now left eyeing other potential areas for improvement based on the above information, like how 35% of the time is spent in file IO. I'll be weighing the potential benefits of adding a file cache to prevent regeneration of unchanged posts along with the complexity overhead that would necessarily accompany such a solution. For now, I am just really pleased I managed to claw back a 25% performance improvement in an evening.