Under A Boddhi Tree: September 2014

Sunday, September 7, 2014

Wrapping my head around python asyncio

Trying to understand python's asyncio is a challenge. First, I personally don't know what is more difficult, multi-threaded programming, or event driven programming. Multithreaded programming has the difficulty of properly finding and eliminating race conditions and dead|live locks. Event driven programming has the difficulty of non-intuitive flow of control and many layers of abstraction and indirection. So where do we even start? If you just start reading the official documentation on asyncio, you probably won't get too far. Reading PEP 3156 won't get you much farther, though I do recommend studying both.

My main motivation for learning asyncio is probably a little unusual. I wanted to write something like pexpect, without using pexpect. In a nutshell, I wanted to interact with a child subprocess perhaps more than once. Python's subprocess module doesn't let you do this exactly, even though you may think the Popen.communicate() seems to. The problem is that it is a "one-shot" communication. You feed it one string and then you are done. But what if you need to answer multiple prompts from your child process?

So where can we start? I'm learning this too, so as I go, I'll introduce more examples. So let's make a small example of calling a subprocess using asyncio. I won't explain it in detail in this blog.

I will however explain briefly what coroutines are. In a nutshell, a coroutine is a way to factor out code that uses yield. The reason is that yield is "contagious". The very presence of yield in a function turns that function into a generator. So what would you do if you realize that some code you have that uses yield could be factored out into its own code? A coroutine can be spotted by its use of the new "yield from" keyword.

subproc_shell.py

1    """ 
2    Took the example from the tulip project and modified it to make it more like pexpect 
3    """ 
4     
5    import asyncio 
6    import os 
7    from asyncio.subprocess import PIPE, STDOUT 
8    import re 
9     
10    
11   @asyncio.coroutine 
12   def send_input(writer, input, que, regex): 
13       """ 
14       The coroutine that will send its input to the input stream (usually stdin) 
15    
16       :param writer: The stream where input will go (usually stdin) 
17       :param input: A sequence of bytes (not strings) 
18       :param que: an asyncio.Queue used to check if we have what we need 
19       :param regex: a re.compile() object used to see if an item from the que matches 
20       :return: None 
21       """ 
22       input.reverse()  # We have to reverse because we pop() from the end 
23       try: 
24           while input: 
25               item = yield from que.get() 
26               #print("Pulled from queue:", repr(item)) 
27               if item is None: 
28                   break 
29               m = regex.match(item.decode()) 
30               if m: 
31                   line = input.pop() 
32                   #print('sending {} bytes'.format(len(line))) 
33                   writer.write(line) 
34                   d = writer.drain() 
35                   if d: 
36                       # writer.drain() returns a generator 
37                       yield from d 
38                       #print('resume writing') 
39                   writer.close() 
40       except asyncio.QueueEmpty: 
41           pass 
42       except BrokenPipeError: 
43           print('stdin: broken pipe error') 
44       except ConnectionResetError: 
45           print('stdin: connection reset error') 
46       except Exception as ex: 
47           print(ex) 
48    
49   @asyncio.coroutine 
50   def log_errors(reader): 
51       while True: 
52           line = yield from reader.read(512) 
53           if not line: 
54               break 
55           print('ERROR', repr(line)) 
56    
57    
58   @asyncio.coroutine 
59   def read_stdout(stdout, que): 
60       """ 
61       The coroutine that reads non-blocking from a reader stream
62       :param stdout: the stream we will read from
63       :param que: an asyncio.Queue object we put lines into
64       
65       """ 
66       while True: 
67           line = yield from stdout.read(512)  # use this instead of readline() so we dont pause on a newline 
68           print('Received from child:', repr(line)) 
69           que.put_nowait(line)  # put the line into the que, so it can be read by send_input() 
70           if not line: 
71               que.put_nowait(None)  # A sentinel so that when send_input() pulls this from the que, it will stop 
72               break 
73    
74    
75   @asyncio.coroutine 
76   def start(cmd, inp=None, queue=None, shell=True, wait=True, **kwargs): 
77       """ 
78       Kicks off the subprocess
79       :param cmd: str of the command to run
80       :param inp: 
81       :param kwargs: 
82       :return: 
83       """ 
84       kwargs['stdout'] = PIPE 
85       kwargs['stderr'] = STDOUT 
86       if inp is None and 'stdin' not in kwargs: 
87           kwargs['stdin'] = None 
88       else: 
89           kwargs['stdin'] = PIPE 
90    
91       fnc = asyncio.create_subprocess_shell if shell else asyncio.create_subprocess_exec 
92       proc = yield from fnc(cmd, **kwargs) 
93    
94       q = queue or asyncio.Queue()  # Stores our output from read_stdout and pops off (maybe) from send_input 
95       regex = re.compile("Reset counter") 
96    
97       tasks = [] 
98       if proc.stdout is not None: 
99           tasks.append(read_stdout(proc.stdout, q)) 
100      else: 
101          print('No stdout') 
102      if inp is not None: 
103          tasks.append(send_input(proc.stdin, inp, q, regex)) 
104      else: 
105          print('No stdin') 
106   
107      if 0: 
108          if proc.stderr is not None: 
109              tasks.append(log_errors(proc.stderr)) 
110          else: 
111              print('No stderr') 
112   
113      if tasks: 
114          # feed stdin while consuming stdout to avoid hang 
115          # when stdin pipe is full 
116          yield from asyncio.wait(tasks) 
117   
118      if wait: 
119          exitcode = yield from proc.wait() 
120          print("exit code: %s" % exitcode) 
121      else: 
122          return proc 
123   
124   
125  def main(): 
126      if os.name == 'nt': 
127          loop = asyncio.ProactorEventLoop() 
128          asyncio.set_event_loop(loop) 
129      else: 
130          loop = asyncio.get_event_loop() 
131      loop.run_until_complete(start('c:\\python34\python.exe dummy.py', inp=[str(x).encode() for x in (3, 3, 0)])) 
132      loop.close() 
133   
134   
135  if __name__ == '__main__': 
136      main()

OpenStack clojure tool

So, I'm at Red Hat now :) I'm now a Quality Engineer working on OpenStack which is a new direction for me. This is the first time I have been working on a 100% software project. Well, that's not entirely true I guess. I did spend 18 months designing an automation framework from scratch. Nevertheless, this is new and interesting.

That being said, OpenStack is written almost entirely in Python. Namely Python 2.7. uggh.

Why the moaning? I used to be a big rah-rah python guy. A former co-worker of mine even jokingly wondered if Guido Van Rossum was paying me money to try to switch our company over to Python (which by the way, I pretty much single handedly did). However, over the years, I have come to find many pain points with the language that has considerably dimmed my enjoyment of the language. Now, hopefully I won't get any flames. Python isn't a bad language and it has quite a few interesting features. I just find myself longing for some things python lacks. And indeed, with some really interesting dynamic interpreted (or JIT'ed) system programming languages like Go, Julia, and even Swift, I really wonder how much wind is going to get taken out of Python's sails in the next 5 years? And that doesn't even factor in the non-system's programming languages like Clojure, Elixir or even TypedScript or LiveScript (a haskell-like variant of javascript).

Duck Typing ain't enough
I think Python 3 came to this realization with their new argument annotations, and so Python 3 doesn't suffer from this problem like Python 2 does. Type hinting is the way to go. It allows the developer to rapidly prototype an idea, and then for performance or documentation reasons, type the variables and return code later. It would be nicer if Python was like Julia (or TypedClojure) and allowed even locals to be optionally typed.

When you start getting code bases into the many tens of thousands or more of code, you just look at a function and wonder "ok, what kind of variable am I supposed to pass in?". Some of you may be saying that's what a good docstring is for. I would agree, except that we all know the first thing to bit-rot is documentation.

Moreover, duck-typing can lead to unintended problems. Perhaps you want to pass in an object that supports the method quack(). Unfortunately, the user so happens to pass in a BadDoctor class object, and your function happily calls the quack() method for you.

Hard to make constants (immutables)
Basically, if you want to truly make an immutable object in python, you'll need to subclass from int, tuple, string or some other immutable built in type. And it's a little odd to do so. It's one of the few places that implementing __new__() is required. I often tell people that python's __init__ is not the constructor, __new__ is. It is __new__ that actually allocates the memory for the object, and __init__ initializes the allocated memory. If you have an immutable object, you have to give it a value as soon as it is created.

Another way to make "read-only" objects is to use a setter property. It's not fool-proof, but it does allow one to make a mostly read-only object. You could also reimplement __getattr__ and __setattr__ for the class and have it look up what you are trying to access. And lastly, you could write a C(++) module for the data structure which does have const. But really, would you want to do that?

Performance
Pypy aside, python's performance leaves something to be desired. It also seems that Guido is totally nonplussed by python's performance and thinks it's good enough. I was quite startled to recently learn how good the V8 Javascript engine performs. That's not bad at all, and would make it on average about as fast as PyPy. But Openstack requires regular CPython, mainly because of lots of dependencies on modules that use C modules (when you install OpenStack from devstack or packstack, you'll see some source compilation going on).

The Browser as the new VM
Like it or not, the browser is kind of the new VM. That means that javascript is becoming as important as C or Java and just as ubiquitous. Having an application that can run virtually anywhere, including mobile devices is not to be scoffed at. Also, I was surprised to learn the new tricks HTML 5 has up its sleeve. This includes a File System API so that you can finally read local files (albeit to a sandboxed file system), the websocket API, WebGL, and drag and drop support just to mention a few. Since javascript is the de facto language of the browser, that means for better or worse learning javascript. There are quite a few python-to-javascript libraries out there, including pyjs, and brython. However, they are not developed by the core python team and so I wonder if/when support will end? And brython only supports python3.

No persistent data structures built-in
So there is pysistence. But being a non-standard 3rd party library and with the lack of data-typing, it means that users will not be sure when persistent or non-persistent data types are being used. Why do we want persistent data structures by default? This page and this one sum it up pretty well.

Lack of good concurrency
Python, thanks to the GIL, doesn't really have true concurrency. There is multiprocessing, which fires up a new python interpreter, but it does have some limitations (like the arguments must be pickle-able on Windows) which can be a real pain. Also, since a new python interpreter is getting fired up for each new multiprocess, python developers can't really laugh at the JVM's large consumption of memory once you start firing up 20+ processes. Hopefully pypy will solve this problem with Software Transactional Memory.