I've always known about profiling in Python, but never bothered to dig too deeply into it. My programs are typically fast enough and it would be more work to speed them up. I've decided to try and change that though and dip my toes into the world of profiling and see if I can't divine some architectural insights into a project of my own - Static.
Quick recap, Static is a my own static site generator, used to generate this
site from a collection of Markdown files and some very basic HTML
templates. The whole thing is available to
browse on GitHub. High-level: it uses
Mistune and Jinja and doesn't do anything fancy. This means it recreates
the entire site from scratch every time it is run, reading .md
files,
parsing, generating HTML and then templating - progressively more every time I
write another post.
All of this means there is a foreseeable future in which things become unbearably slow because I've simply written too much and the generator is too naive. That's not yet the case, this site is regenerated in about 5⁄10ths of a second. But it could happen!
The basics of profiling seem pretty well covered between two projects that I looked at, cProfile and line_profiler and it would seem the Python community has more or less standardized between these two. Fundamentally, they watch a program run and then either breakdown or aggregate the associated execution times for each function or line of code. I began with cProfile because it is built into the standard library and seemed approachable.
The profiler is only any good if you actually exercise the various code paths through the program, I debated and fiddled with running against my test suite originally, but the run times are less impactful there due to the limited body of data being worked with. Instead, I've opted to simply run the profiler against my actual blog instance with all of the post data that entails.
The command necessary to invoke cProfile is as follows:
$ python -m cProfile -s tottime ../static/static.py
511796 function calls (488533 primitive calls) in 0.557 seconds
| ncalls | tottime | percall | cumtime | percall | filename:lineno(function) |
|-----------+---------+---------+---------+---------+------------------------------------------------|
| 275 | 0.154 | 0.001 | 0.155 | 0.001 | {built-in method io.open} |
| 57279 | 0.050 | 0.000 | 0.050 | 0.000 | {method 'match' of '_sre.SRE_Pattern' objects} |
| 78 | 0.017 | 0.000 | 0.017 | 0.000 | {built-in method builtins.compile} |
| 2666/2423 | 0.015 | 0.000 | 0.054 | 0.000 | mistune.py:550(manipulate) |
| 79503 | 0.014 | 0.000 | 0.014 | 0.000 | {built-in method builtins.getattr} |
| 15865 | 0.012 | 0.000 | 0.026 | 0.000 | nodes.py:164(iter_child_nodes) |
| 201 | 0.010 | 0.000 | 0.011 | 0.000 | sre_compile.py:248(_optimize_charset) |
| 3288 | 0.009 | 0.000 | 0.023 | 0.000 | lexer.py:594(tokeniter) |
| 51080 | 0.009 | 0.000 | 0.009 | 0.000 | {built-in method builtins.isinstance} |
| 26671 | 0.008 | 0.000 | 0.011 | 0.000 | nodes.py:148(iter_fields) |
| 57 | 0.007 | 0.000 | 0.007 | 0.000 | {built-in method marshal.loads} |
| 3552 | 0.007 | 0.000 | 0.007 | 0.000 | {method 'sub' of '_sre.SRE_Pattern' objects} |
| 6435/154 | 0.007 | 0.000 | 0.078 | 0.001 | visitor.py:34(visit) |
| 1534/1076 | 0.006 | 0.000 | 0.034 | 0.000 | mistune.py:216(manipulate) |
Where, above, I've sorted on total time taken. Reading the reported results is a bit of a struggle but if you squint you can see at least one problem is hinted at. The single most expensive operation is in file I/O. I might have guessed as much, in this kind of program, but I've no context on whether that number is too high.
The same profile, sorted on cumulative time results in the following (abbreviated):
| ncalls | tottime | percall | cumtime | percall | filename:lineno(function) |
|--------+---------+---------+---------+---------+------------------------------------|
| 151/1 | 0.006 | 0.000 | 0.577 | 0.577 | {built-in method builtins.exec} |
| 1 | 0.000 | 0.000 | 0.577 | 0.577 | static.py:18(<module>) |
| 1 | 0.001 | 0.001 | 0.388 | 0.388 | static.py:106(create_posts) |
| 77 | 0.000 | 0.000 | 0.196 | 0.003 | environment.py:791(get_template) |
| 77 | 0.001 | 0.000 | 0.195 | 0.003 | environment.py:768(_load_template) |
| 37 | 0.000 | 0.000 | 0.186 | 0.005 | static.py:43(templatize_post) |
| 77 | 0.000 | 0.000 | 0.184 | 0.002 | loaders.py:99(load) |
| 77 | 0.001 | 0.000 | 0.173 | 0.002 | environment.py:525(compile) |
| 275 | 0.161 | 0.001 | 0.162 | 0.001 | {built-in method io.open} |
| 37 | 0.000 | 0.000 | 0.114 | 0.003 | mistune.py:971(__call__) |
| 37 | 0.000 | 0.000 | 0.114 | 0.003 | mistune.py:981(parse) |
| 37 | 0.001 | 0.000 | 0.109 | 0.003 | mistune.py:1023(output) |
The cumulative measure means the various parts are grouped so as to try and
highlight which portions of the code are the hottest. In my case it's a bit
lost because there is effectively a single code path at the top level
(create_posts
). The interesting portion is how many times get_template
is
being called and how relatively expensive it is (the same for _load_template
,
the both of which comprise most of templatize_post
). I am, at this point,
suspicious of my file access patterns because the profile seems to indicate 37
calls to template a post, but 77 calls to get the template, and 275 calls
to open a file. For reference, at the time of profiling, I have 37 posts saved
as Markdown files.
At this point, I think a diversion into a line-by-line profiler may be
interesting. Rather than diving into my code to spot the mistakes based on
cProfile alone, I thought I would assess the other tool in my profiling
arsenal. Kernprof
is a third-party package that aims to demystify profiling, and I think the
results are pretty good. Despite the fact that I couldn't coerce pip
into
managing a download of line_profiler
I pulled the source off of GitHub and
installed the one missing dependency (Cython) in a few minutes without issue.
Quoting the project README here:
kernprof will create an instance of LineProfiler and insert it into the
__builtins__
namespace with the name profile. It has been written to be used as a decorator, so in your script, you decorate the functions you want to profile with@profile
.
I wrapped my "top-level" function seen above (create_posts
) with the
@profile
decorator and launched kernprof with the following:
$ kernprof -l -v ../static/static.py
Wrote profile results to static.py.lprof
Timer unit: 1e-06 s
Total time: 0.560446 s
File: ../static/static.py
Function: create_posts at line 105
Line # Hits Time Per Hit % Time Line Contents
==============================================================
105 @profile
106 def create_posts(self):
107 1 5 5.0 0.0 posts_dir = self.config['posts_dir']
108 1 1 1.0 0.0 output_dir = self.config['output_dir']
109
110 38 1074 28.3 0.2 for directory, filename in self.collect_posts(posts_dir):
111 37 440 11.9 0.1 post_file = os.path.join(directory, filename)
112 37 6604 178.5 1.2 raw_text = read_file(post_file)
113 37 476 12.9 0.1 header_string, body = self.split_post(raw_text)
114 37 8421 227.6 1.5 post = self.parse_post_parts(header_string, body)
115 37 156343 4225.5 27.9 post['body'] = self.markdown(post['body'])
116 37 306538 8284.8 54.7 templated_html = self.templatize_post(post)
117 37 558 15.1 0.1 relative_dir = re.sub(posts_dir, '', directory, count=1)
118
119 # in case I want to directly specify the generated URI/filename (as
120 # in the case of an index) without having to title it "index"
121 37 33 0.9 0.0 if 'altname' in post:
122 2 1 0.5 0.0 _filename = post['altname']
123 else:
124 35 1814 51.8 0.3 _filename = '{}.html'.format(slugify(post['title']))
125
126 37 285 7.7 0.1 post['path'] = os.path.join(relative_dir, _filename)
127 37 243 6.6 0.0 full_path = os.path.join(output_dir, relative_dir, _filename)
128 37 77410 2092.2 13.8 write_output_file(templated_html, full_path)
129
130 37 86 2.3 0.0 if 'date' in post:
131 36 72 2.0 0.0 self.all_posts.append(post)
132
133 1 42 42.0 0.0 self.all_posts = self.sort_posts_by_date(self.all_posts)
Which, coming from cProfile as viewed above, is almost magical. Somewhat
comfortingly, most of the body of create_posts
is called 37 times, once for
each post being created. This belies a problem hinted at in the output from
cProfile above, but I am at least not doing anything totally asinine.
I will admit some surprise to find that templating the post body is almost exactly 2x the time taken to parse the Markdown of the post body. Mistune is really spot on when making claims about the speed of performance.
In case you haven't spotted my mistake, and one that I am apparently paying quite dearly for - for each of the 37 posts, I am reloading the posts template which never changes. Obviously this should be loaded once within the class instead.
Applying the following patch:
diff --git a/static.py b/static.py
index f540e48..6480f21 100755
--- a/static.py
+++ b/static.py
@@ -36,0 +37,2 @@ class StaticGenerator():
+ self._env = Environment(loader=FileSystemLoader(self.config['templates_dir']))
+ self._post_template = self._env.get_template('post.html')
@@ -43,4 +45,2 @@ class StaticGenerator():
- def templatize_post(self, post_obj, template_name='post.html'):
- env = Environment(loader=FileSystemLoader(self.config['templates_dir']))
- template = env.get_template(template_name)
- return template.render(post=post_obj)
+ def templatize_post(self, post_obj):
+ return self._post_template.render(post=post_obj)
Results in the following improvement.
$ kernprof -l -v ../static/static.py
Wrote profile results to static.py.lprof
Timer unit: 1e-06 s
Total time: 0.257144 s
File: ../static/static.py
Function: create_posts at line 105
Line # Hits Time Per Hit % Time Line Contents
==============================================================
115 37 155833 4211.7 60.6 post['body'] = self.markdown(post['body'])
116 37 11963 323.3 4.7 templated_html = self.templatize_post(post)
$ python -m cProfile -s cumulative ../static/static.py
287730 function calls (280991 primitive calls) in 0.411 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
79/1 0.007 0.000 0.411 0.411 {built-in method builtins.exec}
1 0.000 0.000 0.411 0.411 static.py:18(<module>)
1 0.001 0.001 0.214 0.214 static.py:105(create_posts)
167 0.147 0.001 0.147 0.001 {built-in method io.open}
37 0.000 0.000 0.121 0.003 mistune.py:971(__call__)
37 0.000 0.000 0.121 0.003 mistune.py:981(parse)
37 0.001 0.000 0.117 0.003 mistune.py:1023(output)
76/10 0.000 0.000 0.090 0.009 <frozen importlib._bootstrap>:966(_find_and_load)
76/10 0.000 0.000 0.090 0.009 <frozen importlib._bootstrap>:939(_find_and_load_unlocked)
73/10 0.000 0.000 0.088 0.009 <frozen importlib._bootstrap>:659(_load_unlocked)
57/10 0.000 0.000 0.088 0.009 <frozen importlib._bootstrap_external>:656(exec_module)
94/10 0.000 0.000 0.085 0.008 <frozen importlib._bootstrap>:214(_call_with_frames_removed)
1 0.000 0.000 0.078 0.078 static.py:60(copy_media)
...
It is easy to imagine this process scaling up with more "real world" projects along similar lines to my experience here. I am now left eyeing other potential areas for improvement based on the above information, like how 35% of the time is spent in file IO. I'll be weighing the potential benefits of adding a file cache to prevent regeneration of unchanged posts along with the complexity overhead that would necessarily accompany such a solution. For now, I am just really pleased I managed to claw back a 25% performance improvement in an evening.