martedì 4 dicembre 2012

Python: performance comparison of itertools dropwhile and takewhile against simple generators

While wandering around on the internet, i stumbled upon this thread, in which is discussed whether or not dropwhile and takewhile should be deprecated, and later removed, from the itertools library. As for Hettinger, who wrote itertools, the use of dropwhile and takewhile lead to less readable code, as everything they are used for can be implemented using generators. While reading, I thought 'ok, maybe they are less readable, but shouldn't be the purpose of itertools to provide a toolbox to perform loops in an efficient way using mainly pure c code? Even if generators are more readable, which is debatable, there's no way that generators can be faster than pure c code'. As in the discussion efficiency wasn't mentioned, i decided to profile generators against dropwhile and takewhile myself.

In the discussion mentioned above, the following use case is used to show the use of dropwhile/takewhile: iterate over text delimited by start and end markers. Here is the timing code:

The result was quite surprising: generators, in this case, are faster than dropwhile and takewhile by something around 30-40%.  I tested the code with both python 2.7.3 and python 3.3.0 with similar results (python 3.3.0 being slower for both functions).
import timeit
import dis 
import random
from itertools import dropwhile,takewhile
FILE_PATH = "test_data/data/text_with_start_end_markers.txt"
FILE = [line for line in open(FILE_PATH)]
START_MARKER = 'start_marker'
END_MARKER = 'end_marker'

def iter_block_generator(lines, start_marker, end_marker):
  lines = iter(lines)
  for line in lines:
    if line.startswith(start_marker):
      yield line
      break
  for line in lines:
    if line.startswith(end_marker):
      return
    yield line

def iter_block_itertools(lines, start_marker, end_marker):
  return takewhile(lambda x: not x.startswith(end_marker),
                   dropwhile(lambda x: not x.startswith(start_marker),
                             lines)
                  )


print("check that both solutions return the same result:")
join_using_itertools = \
        "".join(iter_block_itertools(FILE,
                                     START_MARKER,
                                     END_MARKER
                                    )
                )
join_using_generator = \
        "".join(iter_block_generator(FILE,
                                     START_MARKER,
                                     END_MARKER
                                     )
                )
assert join_using_itertools == join_using_generator
iter_block_generator_func = \
        "''.join(iter_block_generator(FILE," + \
                                     "START_MARKER," + \
                                     "END_MARKER))"
  
iter_block_itertools_func = \
        "''.join(iter_block_itertools(FILE," + \
                                     "START_MARKER," + \
                                     "END_MARKER))"


for function in (iter_block_generator_func,iter_block_itertools_func):
 print(function)
 print(timeit.repeat(
                function,
                repeat=1,
                number=5,
                setup="from __main__ import " + \
                      "iter_block_generator," + \
                      "iter_block_itertools," + \
                      "FILE,START_MARKER,END_MARKER"
                )
       )
       
I'm probably missing something, so let me know if you find a good reason for this.