benchmarks and lxml

Secret Weblog

Becoming More Xee: A Modern XPath and XSLT Engine in Rust Looking for new challenges! Repeat Yourself, A Bit The Curious Case of Quentell The Humble For Loop in Rust The Humble For Loop in JavaScript Don Question Best Practices I Was a 1980s Teenage Programmer Part 5: Achieving Assembly I Was a 1980s Teenage Programmer Part 4: The Call of Assembly The Tooling Shift I Was a 1980s Teenage Programmer Part 3: MSX-2 JavaScript: when you need two ways to do it! Empowering Programming Languages Bloat and Retrofuturism Refreshing my Blog Again Random Rust Impressions Apilar: An Alife System I Was a 1980s Teenage Programmer Part 2: Olivetti M24 I Was a 1980s Teenage Programmer: the Alphatronic SolidJS fits my brain Is premature optimization the root of all evil? Framework Patterns: JavaScript edition Roll Your Own Frameworks Framework Patterns Secret Weblog Highlights Refactoring to Multiple Exit Points mstform: a form library for mobx-state-tree Seven Years: A Very Personal History of the Web Morepath 0.16 released! Is Morepath Fast Yet? Introducing Bob Strongpinion Punctuated Equilibrium in Software Morepath 0.15 released! Impressions of React Europe 2016 Morepath 0.14 released! Morepath 0.13 now with Dectate Dectate: advanced configuration for Python code JavaScript Dependencies Revisited: An Example Project The Incredible Drifting Cyber A Brief History of Reselect The Emerging GraphQL Python stack Thoughts about React Europe Build a better batching UI with Morepath and Jinja2 GraphQL and REST Server Templating in Morepath 0.10 10 reasons to check out the Morepath web framework in 2015 A Review of the Web and how Morepath fits in Morepath 0.9 released! Better REST with Morepath 0.8 Morepath 0.7: new inter-app linking They say something I don Life at the Boundaries: Conversion and Validation BowerStatic 0.4 released! Morepath 0.6 released! Morepath 0.5(.1) and friends released! New HTTP 1.1 RFCs versus WSGI Against On Naming In Open Source My visit to EuroPython 2014 Morepath 0.4.1 released (with Python 3 fixes) Morepath 0.4 and breaking changes Announcing BowerStatic Morepath 0.3 released! Morepath 0.2 Morepath Python 3 support The Call of Python 2.8 Morepath 0.1 released! WebOb and Werkzeug compared Morepath: from Werkzeug to WebOb Racing the Morepath: SQLAlchemy Integration The Centre Cannot Hold Breaking Morepath Changes Morepath Update How to do REST with Morepath Morepath Security the Gravity of Python 2 #python2.8 discussion channel on freenode Alex Gaynor on Python 3 Morepath Documentation Starting to Take Shape Back to the Center Morepath App Reuse Implementing Grok Grok: the Idea Why Linux Works for Me On the Morepath Reg, Now With More Generic! The New Zope as a Web Framework Jim Fulton, Zope Architect Renewing Zope Object Publishing The Weirdness of Zope The Rise of Zope My Exit from Zope Reg: Component Architecture Reimagined JSConf EU 2013 impressions Obviel 1.0! JS Dependency Tools Redux Succinct data structures

benchmarks and lxml

Martijn Faassen · 2005-01-24 · via Secret Weblog

The recent cElementTree release is causing some waves in the Python/XML community. It started when Uche Ugbuji posted The Python Community has too many deceptive XML benchmarks to his blog.

The effbot was not amused, as could be witnessed by his comment on it, and the blog entries:

http://online.effbot.org/2005_01_01_archive.htm#sigh http://online.effbot.org/2005_01_01_archive.htm#faking-it http://online.effbot.org/2005_01_01_archive.htm#faking-it-2 http://online.effbot.org/2005_01_01_archive.htm#faking-it-3

The problem is that Uche unwittingly introduced a benchmark that is rather.. deceptive. He has been testing the time taken by the whole program, including startup and shutdown of the Python interpreter, module importing, and the like, instead of the part where XML processing takes place. Unless you're writing command line scripts or classic CGI web applications, Python startup time is hardly relevant, and shouldn't be part of the measurement.

A while back while developing lxml.etree I was curious what benchmark Fredrik was using. I couldn't find the information on the web, but he told me when I mailed him about it. He was using the simple, obvious strategy which I myself had already been using:

.. imports ..
start = time.time() # time.clock() on windows
.. do the actual work ..
end = time.time()
print end - start

To measure approximate memory usage, he puts in a pause in the program before and after the processing, and checks the process overview on his machine manually.

I've replicated his results with cElementTree and ElementTree fairly well, though my machine is a bit different in its performance characteristics due to platform differences. See other blog entries for more info on this.

For fun, I thought I'd try Uche's benchmark against lxml.etree on this machine. I've also tested it against cElementTree (an older version, I can't keep up with Fredrik's releases; hm, no __version__ string I can find, so don't know what 0.9.x version it is.. reminds me to add one to lxml when the time comes for a release..).

Here's Uche's program adjusted for etree. As you can see, only the import statement needs to change:

import lxml.etree as ElementTree

tree = ElementTree.parse("ot.xml")
for v in tree.findall("//v"):
    text = v.text
    if text.find(u'begat') != -1:
        print text

I've also rewritten it to use xpath instead:

from lxml.import etree as ElementTree

tree = ElementTree.parse("ot.xml")
for text in tree.xpath("//v[contains(., 'begat')]/text()"):
    print text

Since this program is printing stuff, and printing overhead can be large, I've tried a number of tests:

Unix 'time' command, print to stdout on Gnome terminal
Unix 'time' command, redirect output to file
time.time(), print to stdout on Gnome terminal
time.time(), redirect output to file

Here are the results:

A      B      C      D
--------------------------
cElementTree      1.06s  0.32s  0.9s   0.23s
lxml.etree        1.2s   0.43s  1.1s   0.36s
lxml.etree xpath  0.53s  0.25s  0.42s  0.17s

As you can see from the results, the type of terminal you're printing to matters a lot. In case of the xpath tests, almost half of the time is spent printing to the terminal, and for the other tests the overhead seems to be even more.

Also note that at last I can claim a minor victory over cElementTree on my machine on this particular test! lxml.etree, when using xpath to do the task set, is faster than this version of cElementTree. Of course most of the credit here goes to libxml2's blazingly fast xpath implementation here.

All this shows benchmarks are nice as there are so many to choose from.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Secret Weblog