root/masterdriverz/use-expand/dev-notes/hacking.rst @ marienz%2540gentoo.org-20070116022654-7pfe6k6cw144kx3r

Revision marienz%2540gentoo.org-20070116022654-7pfe6k6cw144kx3r, 21.3 kB (checked in by Marien Zwart <marienz@…>, 2 years ago)

Minor addition to hacking doc.

Line 
1=========================
2 python code guidelines.
3=========================
4
5lingo:
6
7fex
8  for example
9imo
10  in my opinion; 'my' refering to the author
11
12Note not all of the existing code follows this.
13Doesn't mean existing code is correct however ;)
14
15Stats are all from a sempron 1.6ghz with python 2.4.2
16
17Finally, code _should_ be documented, following epydoc/epytext guidelines
18
19Follow pep8, with following exemptions
20======================================
21
22- <80 char limit is only applicable where it doesn't make the logic
23  ugly. This is not an excuse to have a 200 char if statement (fix
24  your logic). Use common sense.
25- combining imports is ok.
26- use absolute imports
27- _simple_ try/except combined lines are acceptable, but not forced
28  (your call). example::
29
30   try: l.remove(blah)
31   except IndexError: pass
32
33- for comments. 2 spaces trailing is pointless- not needed.
34- classes should be named SomeClass, functions/methods should be named
35  some_func.
36- Exceptions are classes.  Don't raise strings either.
37- avoid __var 'private' attributes unless you absolutely have a reason
38  to hide it, and the class won't be inherited (or that attribute
39  must _not_ be accessed)
40- using string module functions when you could use a string method is
41  evil. Don't do it.
42- use isinstance(str_instance, basestring) unless you _really_ need to
43   know if it's utf8/ascii
44
45Throw self with a NotImplementedError
46=====================================
47
48Reason for this is simple; if you just throw a NotImplementedError, you can't
49tell how the path was hit if derivative classes are involved; thus throw
50NotImplementedError(self, string_name_of_attr)
51
52Far better tbs.
53
54Be aware of what the interpretter is actually doing.
55====================================================
56
57Don't use len(list_instance) when you just want to know if it's
58nonempty/empty::
59
60  l=[1]
61  if l: blah
62  # instead of
63  if len(l): blah
64
65python looks for __nonzero__, then __len__. It's a fair sight faster
66then if you try to be explicit there::
67
68  python -m timeit -s 'l=[]' 'if len(l) > 0: pass'
69  1000000 loops, best of 3: 0.705 usec per loop
70
71  python -m timeit -s 'l=[]' 'if len(l): pass'
72  1000000 loops, best of 3: 0.689 usec per loop
73
74  python -m timeit -s 'l=[]' 'if l: pass'
75  1000000 loops, best of 3: 0.302 usec per loop
76
77don't explicitly use has_key.  rely on the 'in' operator.
78=========================================================
79
80::
81
82  python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' 'd.has_key(1999999)'
83  1000000 loops, best of 3: 0.512 usec per loop
84
85  python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' '1999999 in d'
86  1000000 loops, best of 3: 0.279 usec per loop
87
88Python interprets the 'in' command by using __contains__ on the
89instance. The interpretter is faster at doing getattr's then actual
90python code is; fex, the code above uses d.__contains__ , if you do
91d.has_key or d.__contains__, it's the same speed. Using 'in' instead
92has the interpretter do the lookup, and is faster.
93
94So... be aware of how the interpretter will execute that code. Python
95code specified attribute access is slower then the interpretter doing
96it on it's own.
97
98If in doubt, python -m timeit is your friend. ;-)
99
100Do not use [] or {} as default args in function/method definitions.
101===================================================================
102
103::
104
105  >>> def f(default=[]):
106  >>>   default.append(1)
107  >>>   return default
108  >>> print f()
109  [1]
110  >>> print f()
111  [1,1]
112
113When the function/class/method is defined, the default args are
114instantiated _then_, not per call. End result is that if it's a
115mutable default arg, you should use None and test; this is exempted if
116you _know_ the code doesn't mangle the default.
117
118Visible curried functions should have documentation.
119====================================================
120
121When using the currying methods (pkgcore.util.currying) for function
122mangling, preserve the documentation via pretty_docs.
123
124If this is exempted, pydoc output for objects isn't incredibly useful.
125
126unit testing.
127=============
128
129All code _should_ have test case functionality.  We use twisted.trial; should
130be running >=2.2 (<2.2 results in false positives in the spawn tests).
131Regressions should be test cased, exempting idiot mistakes (typos fex).
132
133More then willing to look at code that lacks tests, but
134merging/integrating the code requires tests.
135
136One area that is (atm) exempted from this is the ebuild interaction;
137testing that interface is extremely hard, although it _does_ need to
138be implemented.
139
140If tests are missing from code (author didn't write tests initially),
141new tests desired. :)
142
143
144If it's FS related code, it's _usually_ cheaper to try then to ask then try.
145============================================================================
146
147...but you should verify it ;)
148
149
150existing file (but empty to avoid reading overhead)::
151
152  echo > dar
153
154  python -m 'timeit' -s 'import os' 'os.path.exists("dar") and open("dar").read()'
155  10000 loops, best of 3: 36.4 usec per loop
156
157  python -m 'timeit' -s 'import os' $'try:open("dar").read()\nexcept IOError: pass'
158  10000 loops, best of 3: 22 usec per loop
159
160nonexistant file::
161
162  rm foo
163
164  python -m 'timeit' -s 'import os' 'os.path.exists("foo") and open("foo").read()'
165  10000 loops, best of 3: 29.8 usec per loop
166
167  python -m 'timeit' -s 'import os' $'try:open("foo").read()\nexcept IOError: pass'
168  10000 loops, best of 3: 27.7 usec per loop
169
170Bit of a difference, no?
171
172Note that I qualified this with "If it's FS related code"; syscalls
173aren't cheap- if it's not triggering syscalls, next section is
174relevant.
175
176Catching Exceptions in python code (rather then cpython) isn't cheap.
177=====================================================================
178
179stats from python-2.4.2
180
181When an exception is caught::
182
183  python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' $'try: d[1999]\nexcept KeyError: pass'
184  100000 loops, best of 3: 8.7 usec per loop
185
186  python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' $'1999 in d and d[1999]'
187  1000000 loops, best of 3: 0.492 usec per loop
188
189When no exception is caught, overhead of try/except setup::
190
191  python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' $'try: d[0]\nexcept KeyError: pass'
192  1000000 loops, best of 3: 0.532 usec per loop
193
194  python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' $'d[0]'
195  1000000 loops, best of 3: 0.407 usec per loop
196
197
198Not advocating writing code that doesn't protect itself- just be aware
199of what the code is actually doing, and be aware that exceptions in
200python code are costly due to machinery involved.
201
202Another example is when to use or not to use dict's setdefault or get methods:
203
204key exists::
205
206  # Through exception handling
207  python -m timeit -s 'd=dict.fromkeys(range(100))' 'try: x=d[1]' 'except KeyError: x=42'
208  1000000 loops, best of 3: 0.548 usec per loop
209
210  # d.get
211  python -m timeit -s 'd=dict.fromkeys(range(100))' 'x=d.get(1, 42)'
212  1000000 loops, best of 3: 1.01 usec per loop
213
214
215key doesn't exist::
216
217  # Through exception handling
218  python -m timeit -s 'd=dict.fromkeys(range(100))' 'try: x=d[101]' 'except KeyError: x=42'
219  100000 loops, best of 3: 8.8 usec per loop
220
221  # d.get
222  python -m timeit -s 'd=dict.fromkeys(range(100))' 'x=d.get(101, 42)'
223  1000000 loops, best of 3: 1.05 usec per loop
224
225
226Short version? If you know the key is there, get is slower. If you
227don't, get is your friend. IOW, use it instead of doing a containment
228test then accessing the key.
229
230Of course this only considers the case where the default value is
231simple. If it's something more costly "except" will do relatively
232better since it's not constructing the default value if it's not
233needed. So if in doubt and in a performance-critical piece of code:
234benchmark parts of it with timeit instead of assuming "exceptions are
235slow" or "[] is fast".
236
237cpython 'leaks' vars into local namespace for certain constructs
238================================================================
239
240::
241
242  def f(s):
243      while True:
244          try:
245              some_func_that_throws_exception()
246          except Exception, e:
247              # e exists in this namespace now.
248              pass
249          # some other code here...
250
251from the code above, e bled into the f namespace- that's referenced
252memory that isn't used, and will linger until the while loop exits.
253
254Python _does_ bleed variables into the local namespace- be aware of
255this, and explicitly delete references you don't need when dealing in
256large objs, especially dealing with exceptions::
257
258  class c:
259      d = {}
260      for x in range(1000):
261          d[x] = x
262
263Granted the class above is contrived, but the thing to note is that
264c.x is now valid- the x from the for loop bleeds into the class
265namespace and stays put.
266
267Don't leave uneeded vars lingering in class namespace.
268
269Variables that leak from for loops _normally_ isn't an issue, just be
270aware it occurs- especially if the var is referencing a large object
271(thus keeping it in memory).
272
273So... for loops leak, list comps leak, dependant on your except
274clause, can leak also.
275
276Do not go overboard with this though. If your function will exit soon
277do not bother cleaning up variables by hand. If the "leaking" things
278are small do not bother either.
279
280The current code deletes exception instances explicitly much more
281often than it should since this was believed to clean up the traceback
282object. This does not work: the only thing "del e" frees up is the
283exception instance and the arguments passed to its constructor. "del
284e" also takes a small amount of time to run (clearing up all locals
285when the function exits is faster).
286
287Unless you need to generate (and save) a range result, use xrange.
288==================================================================
289
290python -m timeit 'for x in range(10000): pass'
291100 loops, best of 3: 2.01 msec per loop
292
293$ python -m timeit 'for x in xrange(10000): pass'
2941000 loops, best of 3: 1.69 msec per loop
295
296Removals from a list aren't cheap, especially left most.
297========================================================
298
299If you _do_ need to do left most removals, deque module is your friend.
300
301Rightmost ain't all that cheap either, depending on what idiocy folks
302come up with to try and 'help' the interpretter::
303
304  python -m timeit $'l=range(1000);i=0;\nwhile i < len(l):\n\tif l[i]!="asdf":del l[i]\n\telse:i+=1'
305  100 loops, best of 3: 4.12 msec per loop
306
307  python -m timeit $'l=range(1000);\nfor i in xrange(len(l)-1,-1,-1):\n\tif l[i]!="asdf":del l[i]'
308  100 loops, best of 3: 3 msec per loop
309
310  python -m timeit 'l=range(1000);l=[x for x in l if x == "asdf"]'
311  1000 loops, best of 3: 1 msec per loop
312
313Granted, that's worst case, but worst case is usually where folks get bit.
314(best case still is faster for list comp btw).
315
316Related note, don't pop unless you have a reason to.
317
318If you're testing for None specifically, be aware of the 'is' operator.
319=======================================================================
320
321Is avoids the equality protocol, and does a straight ptr comparison::
322
323  python -m timeit '10000000 != None'
324  1000000 loops, best of 3: 0.721 usec per loop
325
326  $ python -m timeit '10000000 is not None'
327  1000000 loops, best of 3: 0.343 usec per loop
328
329
330Note that we're specificially forcing a large int; using 1 under 2.5 is the
331same runtime, reason being that it defaults to an identity check, then a
332comparison; for small ints, python uses singletons, thus identity kicks in.
333
334Deprecated/crappy modules
335=========================
336
337- Don't use types module. Use isinstance (this isn't a speed reason,
338  types sucks).
339- Don't use strings module. There are exceptions, but use string
340  methods when available.
341- Don't use stat module just to get a stat attribute- fex::
342    import stats
343    l=os.stat("asdf")[stat.ST_MODE]
344
345    # can be done as (and a bit cleaner)
346    l=os.stat("asdf").st_mode
347
348
349Know the exceptions that are thrown, and catch just those you're interested in.
350===============================================================================
351
352::
353
354  try:
355      blah
356  except Exception:
357      blah2
358
359^^^ major issue here. It catches SystemExit exceptions (trigger by
360keyboard interupts); meaning this code, which was just crappy
361exception handling now swallows ctrl+c (meaning it now screws with UI
362code).
363
364Catch what you're interested in *only*.
365
366tuples versus lists.
367====================
368
369Former is immutable, latter is mutable.
370
371Latter over-allocates (cpython thing), meaning it takes up more memory
372then is used (this is actually a good thing usually).
373
374If you're generating/storing a lot of sequences that shouldn't be
375modified, use tuples. Cheaper in memory, and folks can reference the
376tuple directly without concerns of it being mutated elsewhere.
377
378Using lists there however would require each consumer to copy the list
379to protect themselves from mutation. So... over-allocation +
380allocating a new list for each consumer.
381
382Bad, mm'kay.
383
384for immutable instances (tuples/strings fex), trying to copy them is dumb
385=========================================================================
386
387fex: copy.copy((1,2,3)) is dumb; nobody makes a mistake that obvious,
388but in larger code folks do (folks even try using [:] to copy a
389string; it returns the same string since it's immutable).
390
391Can't modify them, thus there is no point in trying to make copies of them.
392
393
394__del__ methods mes with garbage collection
395===========================================
396
397__del__ methods have the annoying side affect of blocking garbage
398collection when that instance is involved in a cycle- basically, the
399interpretter doesn't know what __del__ is going to reference, so its
400unknowable (general case) how to break the cycle.
401
402So... if you're using __del__ methods, make sure the instance doesn't
403wind up in a cycle (whether careful data structs, or weakref usage).
404
405(General) python isn't slow, your algorithm is.
406===============================================
407
408::
409
410  l = []
411  for x in data_generator():
412        if x not in l:
413                l.append(x)
414
415That code is _best_ case O(1) (yielding all 0's fex). Worst case is
416O(N^2).
417
418::
419
420  l=set()
421  for x in data_generator():
422      if x not in l:
423          l.add(x)
424
425Best/Worst are now constant (not quite due to potential expansion of
426the set internally, but that's ignorable in this case).
427
428Further, the first loop actually invokes the __eq__ protocol for x for
429each element, which can potentially be *quite* slow if dealing in
430complex objs.
431
432The second loop invokes __hash__ once on x instead.  Seeing the gain?
433
434Technically, second loop still is a bit innefficient::
435
436  l=set(data_generator())
437
438being simpler and faster.
439
440example data for folks who don't see how _bad_ this can get::
441
442  python -m timeit $'l=[]\nfor x in xrange(1000):\n\tif x not in l:l.append(x)'
443  10 loops, best of 3: 74.4 msec per loop
444
445  python -m timeit $'l=set()\nfor x in xrange(1000):\n\tif x not in l:l.add(x)'
446  1000 loops, best of 3: 1.24 msec per loop
447
448  python -m timeit 'l=set(xrange(1000))'
449  1000 loops, best of 3: 278 usec per loop
450
451Bit of a difference, no?
452
453This does _not_ mean that sets are automatically/better everywhere,
454just be aware of what you're doing- for a single search of a range
455(fex), the overhead of is far slower then a linear search. Kind of a
456'duh', but folks do do this sometimes::
457
458  python -m timeit -s 'l=range(50)' $'if 1001 in set(l): pass'
459  100000 loops, best of 3: 12.2 usec per loop
460
461  python -m timeit -s 'l=range(50)' $'if 1001 in l: pass'
462  10000 loops, best of 3: 7.68 usec per loop
463
464What's up with __hash__ and dicts
465=================================
466
467A bunch of things (too many things most likely) in the codebase define
468__hash__. The rule for __hash__ is (quoted from
469http://docs.python.org/ref/customization.html):
470
471 Should return a 32-bit integer usable as a hash value for dictionary
472 operations. The only required property is that objects which compare
473 equal have the same hash value.
474
475Quick/rough explanation for folks who do not know how a "dict" works
476internally:
477
478- Things added to it are dumped in a "bucket" depending on their hash
479  value.
480- To check if something is in the dict it first determines the bucket
481  to check (based on hash value), then does equality checks (__cmp__
482  or __eq__ if there is one, object identity comparison otherwise) for
483  everything in the bucket (if there is anything).
484
485So what does this mean?
486
487- There's no reason at all to define your own __hash__ unless you also
488  define __eq__ or __cmp__. Behaviour of your object in dicts/sets
489  will not change, it will just be slower (since your own __hash__ is
490  almost certainly slower than the default one).
491- If you define __eq__ or __cmp__ and want your object to be usable in
492  a dict you have to define __hash__. If you don't the default
493  __hash__ is used which means your objects act in dicts like only
494  object identity matters *until* you hit a hash collision and your
495  own __eq__ or __cmp__ kicks in.
496- If you do define your own __hash__ it has to produce the same value
497  for objects that compare equal, or you get *really* weird behaviour
498  in dicts/sets ("thing in dict" returning False because the hash
499  values differ while "thing in dict.keys()" returns True because that
500  does not use the hash value, only does equality checks).
501- If the hash value changes after the object was put in a dict you get
502  weird behaviour too ("s=set([thing]); thing.change_hash();thing in s"
503  is False, but "thing in list(s)" is True). So if your objects are
504  mutable they can usually provide __eq__/__cmp__ but not __hash__.
505- Not having many hash "collisions" (same hash value for objects that
506  compare nonequal) is good, but collisions are not illegal. Too many
507  of them just slow down dict/set operations (in a worst case scenario
508  of the same hash for every object dict/set operations become linear
509  searches through the single hash bucket everything ends up in).
510- If you use the hash value directly keep in mind that collisions are
511  legal. Do not use comparisons of hash values as a substitute for
512  comparing objects (implementing __eq__ / __cmp__). Probably the only
513  legitimate use of hash() is to determine an object's hash value
514  based on things used for comparison.
515
516
517__eq__ and __ne__
518=================
519
520From http://docs.python.org/ref/customization.html:
521
522  There are no implied relationships among the comparison operators.
523  The truth of x==y does not imply that x!=y is false. Accordingly,
524  when defining __eq__(), one should also define __ne__() so that the
525  operators will behave as expected.
526
527They really mean that. If you define __eq__ but not __ne__ doing "!="
528on instances compares them by identity. This is surprisingly easy to
529miss, especially since the natural way to write unit tests for classes
530with custom comparisons goes like this::
531
532  self.assertEquals(YourClass(1), YourClass(1))
533  # Repeat for more possible values. Uses == and therefore __eq__,
534  # behaves as expected.
535  self.assertNotEquals(YourClass(1), YourClass(2))
536  # Repeat for more possible values. Uses != and therefore object
537  # identity, so they all pass (all different instances)!
538
539So you end up only testing __eq__ on equal values (it can say
540"identical" for different values without you noticing).
541
542Adding a __ne__ that just does "return not self == other" fixes this.
543
544
545__eq__/__hash__ and subclassing
546===============================
547
548If your class has a custom __eq__ and it might be subclassed you have
549to be very careful about how you "compare" to instances of a subclass.
550Usually you will want to be "different" from those unconditionally::
551
552  def __eq__(self, other):
553      if self.__class is not YourClass or other.__class__ is not YourClass:
554              return False
555          # Your actual code goes here
556
557This might seem like overkill, but is necessary to avoid problems if
558you are subclassed and the subclass does not have a new __eq__. If you
559just do an "isinstance(other, self.__class__)" check you will compare
560equal to instances of a subclass, which is usually not what you want.
561If you just check for "self.__class__ is other.__class__" then
562subclasses that add a new attribute without overriding __eq__ will
563compare equal when they should not (because the new attribute
564differs).
565
566If you subclass something that has an __eq__ you should most likely
567override it (you might get away with not doing so if the class does
568not do the type check demonstrated above). If you add a new attribute
569don't forget to override __hash__ too (that is not critical, but you
570will have unnecessary hash collisions if you forget it).
571
572This is especially important for pkgcore because of
573pkgcore.util.caching. If an instance of a class with a broken __eq__
574is used as argument for the __init__ of a class that uses
575caching.WeakInstMeta it will cause a cached instance to be used when
576it should not. Notice the class with the broken __eq__ does not have
577to be cached itself to trigger this! Getting this wrong can cause fun
578behaviour like atoms showing up in the list of fetchables because the
579restrictions they're in compare equal independent of their "payload".
580
581
582Exception subclassing
583=====================
584
585It is pretty common for an Exception subclass to want to customize the
586return value of str() on an instance. The easiest way to do that is::
587
588  class MyException(Exception):
589
590      """Describe when it is raised here."""
591
592      def __init__(self, stuff):
593          Exception.__init__(self, 'MyException because of %s' % (stuff,))
594
595This is usually easier than defining a custom __str__ (since you do
596not have to store the value of "stuff" as an attribute) and you should
597be calling the base class __init__ anyway.
598
599(This does not mean you should never store things like "stuff" as
600attrs: it can be very useful for code catching the exception to have
601access to it. Use common sense.)
Note: See TracBrowser for help on using the browser.