root/releases/pkgcore/0.1.1/dev-notes/hacking.rst @ marienz%2540gentoo.org-20060919144405-7abaf990f219a7e6

Revision marienz%2540gentoo.org-20060919144405-7abaf990f219a7e6, 20.5 KB (checked in by Marien Zwart <marienz@…>, 2 years ago)

More formatting.

Line 
1=========================
2 python code guidelines.
3=========================
4
5lingo:
6
7fex
8  for example
9imo
10  in my opinion; 'my' refering to the author
11
12Note not all of the existing code follows this.
13Doesn't mean existing code is correct however ;)
14
15Stats are all from a sempron 1.6ghz with python 2.4.2
16
17Finally, code _should_ be documented, following epydoc/epytext guidelines
18
19Follow pep8, with following exemptions
20======================================
21
22- <80 char limit is only applicable where it doesn't make the logic
23  ugly. This is not an excuse to have a 200 char if statement (fix
24  your logic). Use common sense.
25- combining imports is ok.
26- use absolute imports
27- _simple_ try/except combined lines are acceptable, but not forced
28  (your call). example::
29
30   try: l.remove(blah)
31   except IndexError: pass
32
33- for comments. 2 spaces trailing is pointless- not needed.
34- classes should be named SomeClass, functions/methods should be named
35  some_func.
36- Exceptions are classes.  Don't raise strings either.
37- avoid __var 'private' attributes unless you absolutely have a reason
38  to hide it, and the class won't be inherited (or that attribute
39  must _not_ be accessed)
40- using string module functions when you could use a string method is
41  evil. Don't do it.
42- use isinstance(str_instance, basestring) unless you _really_ need to
43   know if it's utf8/ascii
44
45Throw self with a NotImplementedError
46=====================================
47
48Reason for this is simple; if you just throw a NotImplementedError, you can't
49tell how the path was hit if derivative classes are involved; thus throw
50NotImplementedError(self, string_name_of_attr)
51
52Far better tbs.
53
54Be aware of what the interpretter is actually doing.
55====================================================
56
57Don't use len(list_instance) when you just want to know if it's
58nonempty/empty::
59
60  l=[1]
61  if l: blah
62  # instead of
63  if len(l): blah
64
65python looks for __nonzero__, then __len__. It's a fair sight faster
66then if you try to be explicit there::
67
68  python -m timeit -s 'l=[]' 'if len(l) > 0: pass'
69  1000000 loops, best of 3: 0.705 usec per loop
70
71  python -m timeit -s 'l=[]' 'if len(l): pass'
72  1000000 loops, best of 3: 0.689 usec per loop
73
74  python -m timeit -s 'l=[]' 'if l: pass'
75  1000000 loops, best of 3: 0.302 usec per loop
76
77don't explicitly use has_key.  rely on the 'in' operator.
78=========================================================
79
80::
81
82  python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' 'd.has_key(1999999)'
83  1000000 loops, best of 3: 0.512 usec per loop
84
85  python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' '1999999 in d'
86  1000000 loops, best of 3: 0.279 usec per loop
87
88Python interprets the 'in' command by using __contains__ on the
89instance. The interpretter is faster at doing getattr's then actual
90python code is; fex, the code above uses d.__contains__ , if you do
91d.has_key or d.__contains__, it's the same speed. Using 'in' instead
92has the interpretter do the lookup, and is faster.
93
94So... be aware of how the interpretter will execute that code. Python
95code specified attribute access is slower then the interpretter doing
96it on it's own.
97
98If in doubt, python -m timeit is your friend. ;-)
99
100Do not use [] or {} as default args in function/method definitions.
101===================================================================
102
103::
104
105  >>> def f(default=[]):
106  >>>   default.append(1)
107  >>>   return default
108  >>> print f()
109  [1]
110  >>> print f()
111  [1,1]
112
113When the function/class/method is defined, the default args are
114instantiated _then_, not per call. End result is that if it's a
115mutable default arg, you should use None and test; this is exempted if
116you _know_ the code doesn't mangle the default.
117
118Visible curried functions should have documentation.
119====================================================
120
121When using the currying methods (pkgcore.util.currying) for function
122mangling, preserve the documentation via pretty_docs.
123
124If this is exempted, pydoc output for objects isn't incredibly useful.
125
126unit testing.
127=============
128
129All code _should_ have test case functionality.  We use twisted.trial; should
130be running >=2.2 (<2.2 results in false positives in the spawn tests).
131Regressions should be test cased, exempting idiot mistakes (typos fex).
132
133More then willing to look at code that lacks tests, but
134merging/integrating the code requires tests.
135
136One area that is (atm) exempted from this is the ebuild interaction;
137testing that interface is extremely hard, although it _does_ need to
138be implemented.
139
140If tests are missing from code (author didn't write tests initially),
141new tests desired. :)
142
143
144If it's FS related code, it's _usually_ cheaper to try then to ask then try.
145============================================================================
146
147...but you should verify it ;)
148
149
150existing file (but empty to avoid reading overhead)::
151
152  echo > dar
153
154  python -m 'timeit' -s 'import os' 'os.path.exists("dar") and open("dar").read()'
155  10000 loops, best of 3: 36.4 usec per loop
156
157  python -m 'timeit' -s 'import os' $'try:open("dar").read()\nexcept IOError: pass'
158  10000 loops, best of 3: 22 usec per loop
159
160nonexistant file::
161
162  rm foo
163
164  python -m 'timeit' -s 'import os' 'os.path.exists("foo") and open("foo").read()'
165  10000 loops, best of 3: 29.8 usec per loop
166
167  python -m 'timeit' -s 'import os' $'try:open("foo").read()\nexcept IOError: pass'
168  10000 loops, best of 3: 27.7 usec per loop
169
170Bit of a difference, no?
171
172Note that I qualified this with "If it's FS related code"; syscalls
173aren't cheap- if it's not triggering syscalls, next section is
174relevant.
175
176Catching Exceptions in python code (rather then cpython) isn't cheap.
177=====================================================================
178
179stats from python-2.4.2
180
181When an exception is caught::
182
183  python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' $'try: d[1999]\nexcept KeyError: pass'
184  100000 loops, best of 3: 8.7 usec per loop
185
186  python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' $'1999 in d and d[1999]'
187  1000000 loops, best of 3: 0.492 usec per loop
188
189When no exception is caught, overhead of try/except setup::
190
191  python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' $'try: d[0]\nexcept KeyError: pass'
192  1000000 loops, best of 3: 0.532 usec per loop
193
194  python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' $'d[0]'
195  1000000 loops, best of 3: 0.407 usec per loop
196
197
198Not advocating writing code that doesn't protect itself- just be aware
199of what the code is actually doing, and be aware that exceptions in
200python code are costly due to machinery involved.
201
202Another example is when to use or not to use dict's setdefault or get methods:
203
204key exists::
205
206  # Through exception handling
207  python -m timeit -s 'd=dict.fromkeys(range(100))' 'try: x=d[1]' 'except KeyError: x=42'
208  1000000 loops, best of 3: 0.548 usec per loop
209
210  # d.get
211  python -m timeit -s 'd=dict.fromkeys(range(100))' 'x=d.get(1, 42)'
212  1000000 loops, best of 3: 1.01 usec per loop
213
214
215key doesn't exist::
216
217  # Through exception handling
218  python -m timeit -s 'd=dict.fromkeys(range(100))' 'try: x=d[101]' 'except KeyError: x=42'
219  100000 loops, best of 3: 8.8 usec per loop
220
221  # d.get
222  python -m timeit -s 'd=dict.fromkeys(range(100))' 'x=d.get(101, 42)'
223  1000000 loops, best of 3: 1.05 usec per loop
224
225
226Short version? If you know the key is there, get is slower. If you
227don't, get is your friend. IOW, use it instead of doing a containment
228test then accessing the key.
229
230Of course this only considers the case where the default value is
231simple. If it's something more costly "except" will do relatively
232better since it's not constructing the default value if it's not
233needed. So if in doubt and in a performance-critical piece of code:
234benchmark parts of it with timeit instead of assuming "exceptions are
235slow" or "[] is fast".
236
237cpython 'leaks' vars into local namespace for certain constructs
238================================================================
239
240::
241
242  def f(s):
243      while True:
244          try:
245              some_func_that_throws_exception()
246          except Exception, e:
247              # e exists in this namespace now.
248              pass
249          # some other code here...
250
251from the code above, e bled into the f namespace- that's referenced
252memory that isn't used, and will linger until the while loop exits.
253
254Python _does_ bleed variables into the local namespace- be aware of
255this, and explicitly delete references you don't need when dealing in
256large objs, especially dealing with exceptions::
257
258  class c:
259      d = {}
260      for x in range(1000):
261          d[x] = x
262
263Granted the class above is contrived, but the thing to note is that
264c.x is now valid- the x from the for loop bleeds into the class
265namespace and stays put.
266
267Don't leave uneeded vars lingering in class namespace.
268
269Variables that leak from for loops _normally_ isn't an issue, just be
270aware it occurs- especially if the var is referencing a large object
271(thus keeping it in memory).
272
273So... for loops leak, list comps leak, dependant on your except
274clause, can leak also.
275
276Unless you need to generate (and save) a range result, use xrange.
277==================================================================
278
279python -m timeit 'for x in range(10000): pass'
280100 loops, best of 3: 2.01 msec per loop
281
282$ python -m timeit 'for x in xrange(10000): pass'
2831000 loops, best of 3: 1.69 msec per loop
284
285Removals from a list aren't cheap, especially left most.
286========================================================
287
288If you _do_ need to do left most removals, deque module is your friend.
289
290Rightmost ain't all that cheap either, depending on what idiocy folks
291come up with to try and 'help' the interpretter::
292
293  python -m timeit $'l=range(1000);i=0;\nwhile i < len(l):\n\tif l[i]!="asdf":del l[i]\n\telse:i+=1'
294  100 loops, best of 3: 4.12 msec per loop
295
296  python -m timeit $'l=range(1000);\nfor i in xrange(len(l)-1,-1,-1):\n\tif l[i]!="asdf":del l[i]'
297  100 loops, best of 3: 3 msec per loop
298
299  python -m timeit 'l=range(1000);l=[x for x in l if x == "asdf"]'
300  1000 loops, best of 3: 1 msec per loop
301
302Granted, that's worst case, but worst case is usually where folks get bit.
303(best case still is faster for list comp btw).
304
305Related note, don't pop unless you have a reason to.
306
307If you're testing for None specifically, be aware of the 'is' operator.
308=======================================================================
309
310Is avoids the equality protocol, and does a straight ptr comparison::
311
312  python -m timeit '1 != None'
313  1000000 loops, best of 3: 0.721 usec per loop
314
315  $ python -m timeit '1 is not None'
316  1000000 loops, best of 3: 0.343 usec per loop
317
318Deprecated/crappy modules
319=========================
320
321- Don't use types module. Use isinstance (this isn't a speed reason,
322  types sucks).
323- Don't use strings module. There are exceptions, but use string
324  methods when available.
325- Don't use stat module just to get a stat attribute- fex::
326    import stats
327    l=os.stat("asdf")[stat.ST_MODE]
328
329    # can be done as (and a bit cleaner)
330    l=os.stat("asdf").st_mode
331
332
333Know the exceptions that are thrown, and catch just those you're interested in.
334===============================================================================
335
336::
337
338  try:
339      blah
340  except Exception:
341      blah2
342
343^^^ major issue here. It catches SystemExit exceptions (trigger by
344keyboard interupts); meaning this code, which was just crappy
345exception handling now swallows ctrl+c (meaning it now screws with UI
346code).
347
348Catch what you're interested in *only*.
349
350tuples versus lists.
351====================
352
353Former is immutable, latter is mutable.
354
355Latter over-allocates (cpython thing), meaning it takes up more memory
356then is used (this is actually a good thing usually).
357
358If you're generating/storing a lot of sequences that shouldn't be
359modified, use tuples. Cheaper in memory, and folks can reference the
360tuple directly without concerns of it being mutated elsewhere.
361
362Using lists there however would require each consumer to copy the list
363to protect themselves from mutation. So... over-allocation +
364allocating a new list for each consumer.
365
366Bad, mm'kay.
367
368for immutable instances (tuples/strings fex), trying to copy them is dumb
369=========================================================================
370
371fex: copy.copy((1,2,3)) is dumb; nobody makes a mistake that obvious,
372but in larger code folks do (folks even try using [:] to copy a
373string; it returns the same string since it's immutable).
374
375Can't modify them, thus there is no point in trying to make copies of them.
376
377
378__del__ methods mes with garbage collection
379===========================================
380
381__del__ methods have the annoying side affect of blocking garbage
382collection when that instance is involved in a cycle- basically, the
383interpretter doesn't know what __del__ is going to reference, so its
384unknowable (general case) how to break the cycle.
385
386So... if you're using __del__ methods, make sure the instance doesn't
387wind up in a cycle (whether careful data structs, or weakref usage).
388
389(General) python isn't slow, your algorithm is.
390===============================================
391
392::
393
394  l = []
395  for x in data_generator():
396        if x not in l:
397                l.append(x)
398
399That code is _best_ case O(1) (yielding all 0's fex). Worst case is
400O(N^2).
401
402::
403
404  l=set()
405  for x in data_generator():
406      if x not in l:
407          l.add(x)
408
409Best/Worst are now constant (not quite due to potential expansion of
410the set internally, but that's ignorable in this case).
411
412Further, the first loop actually invokes the __eq__ protocol for x for
413each element, which can potentially be *quite* slow if dealing in
414complex objs.
415
416The second loop invokes __hash__ once on x instead.  Seeing the gain?
417
418Technically, second loop still is a bit innefficient::
419
420  l=set(data_generator())
421
422being simpler and faster.
423
424example data for folks who don't see how _bad_ this can get::
425
426  python -m timeit $'l=[]\nfor x in xrange(1000):\n\tif x not in l:l.append(x)'
427  10 loops, best of 3: 74.4 msec per loop
428
429  python -m timeit $'l=set()\nfor x in xrange(1000):\n\tif x not in l:l.add(x)'
430  1000 loops, best of 3: 1.24 msec per loop
431
432  python -m timeit 'l=set(xrange(1000))'
433  1000 loops, best of 3: 278 usec per loop
434
435Bit of a difference, no?
436
437This does _not_ mean that sets are automatically/better everywhere,
438just be aware of what you're doing- for a single search of a range
439(fex), the overhead of is far slower then a linear search. Kind of a
440'duh', but folks do do this sometimes::
441
442  python -m timeit -s 'l=range(50)' $'if 1001 in set(l): pass'
443  100000 loops, best of 3: 12.2 usec per loop
444
445  python -m timeit -s 'l=range(50)' $'if 1001 in l: pass'
446  10000 loops, best of 3: 7.68 usec per loop
447
448What's up with __hash__ and dicts
449=================================
450
451A bunch of things (too many things most likely) in the codebase define
452__hash__. The rule for __hash__ is (quoted from
453http://docs.python.org/ref/customization.html):
454
455 Should return a 32-bit integer usable as a hash value for dictionary
456 operations. The only required property is that objects which compare
457 equal have the same hash value.
458
459Quick/rough explanation for folks who do not know how a "dict" works
460internally:
461
462- Things added to it are dumped in a "bucket" depending on their hash
463  value.
464- To check if something is in the dict it first determines the bucket
465  to check (based on hash value), then does equality checks (__cmp__
466  or __eq__ if there is one, object identity comparison otherwise) for
467  everything in the bucket (if there is anything).
468
469So what does this mean?
470
471- There's no reason at all to define your own __hash__ unless you also
472  define __eq__ or __cmp__. Behaviour of your object in dicts/sets
473  will not change, it will just be slower (since your own __hash__ is
474  almost certainly slower than the default one).
475- If you define __eq__ or __cmp__ and want your object to be usable in
476  a dict you have to define __hash__. If you don't the default
477  __hash__ is used which means your objects act in dicts like only
478  object identity matters *until* you hit a hash collision and your
479  own __eq__ or __cmp__ kicks in.
480- If you do define your own __hash__ it has to produce the same value
481  for objects that compare equal, or you get *really* weird behaviour
482  in dicts/sets ("thing in dict" returning False because the hash
483  values differ while "thing in dict.keys()" returns True because that
484  does not use the hash value, only does equality checks).
485- If the hash value changes after the object was put in a dict you get
486  weird behaviour too ("s=set([thing]); thing.change_hash();thing in s"
487  is False, but "thing in list(s)" is True). So if your objects are
488  mutable they can usually provide __eq__/__cmp__ but not __hash__.
489- Not having many hash "collisions" (same hash value for objects that
490  compare nonequal) is good, but collisions are not illegal. Too many
491  of them just slow down dict/set operations (in a worst case scenario
492  of the same hash for every object dict/set operations become linear
493  searches through the single hash bucket everything ends up in).
494- If you use the hash value directly keep in mind that collisions are
495  legal. Do not use comparisons of hash values as a substitute for
496  comparing objects (implementing __eq__ / __cmp__). Probably the only
497  legitimate use of hash() is to determine an object's hash value
498  based on things used for comparison.
499
500
501__eq__ and __ne__
502=================
503
504From http://docs.python.org/ref/customization.html:
505
506  There are no implied relationships among the comparison operators.
507  The truth of x==y does not imply that x!=y is false. Accordingly,
508  when defining __eq__(), one should also define __ne__() so that the
509  operators will behave as expected.
510
511They really mean that. If you define __eq__ but not __ne__ doing "!="
512on instances compares them by identity. This is surprisingly easy to
513miss, especially since the natural way to write unit tests for classes
514with custom comparisons goes like this::
515
516  self.assertEquals(YourClass(1), YourClass(1))
517  # Repeat for more possible values. Uses == and therefore __eq__,
518  # behaves as expected.
519  self.assertNotEquals(YourClass(1), YourClass(2))
520  # Repeat for more possible values. Uses != and therefore object
521  # identity, so they all pass (all different instances)!
522
523So you end up only testing __eq__ on equal values (it can say
524"identical" for different values without you noticing).
525
526Adding a __ne__ that just does "return not self == other" fixes this.
527
528
529__eq__/__hash__ and subclassing
530===============================
531
532If your class has a custom __eq__ and it might be subclassed you have
533to be very careful about how you "compare" to instances of a subclass.
534Usually you will want to be "different" from those unconditionally::
535
536  def __eq__(self, other):
537      if self.__class is not YourClass or other.__class__ is not YourClass:
538              return False
539          # Your actual code goes here
540
541This might seem like overkill, but is necessary to avoid problems if
542you are subclassed and the subclass does not have a new __eq__. If you
543just do an "isinstance(other, self.__class__)" check you will compare
544equal to instances of a subclass, which is usually not what you want.
545If you just check for "self.__class__ is other.__class__" then
546subclasses that add a new attribute without overriding __eq__ will
547compare equal when they should not (because the new attribute
548differs).
549
550If you subclass something that has an __eq__ you should most likely
551override it (you might get away with not doing so if the class does
552not do the type check demonstrated above). If you add a new attribute
553don't forget to override __hash__ too (that is not critical, but you
554will have unnecessary hash collisions if you forget it).
555
556This is especially important for pkgcore because of
557pkgcore.util.caching. If an instance of a class with a broken __eq__
558is used as argument for the __init__ of a class that uses
559caching.WeakInstMeta it will cause a cached instance to be used when
560it should not. Notice the class with the broken __eq__ does not have
561to be cached itself to trigger this! Getting this wrong can cause fun
562behaviour like atoms showing up in the list of fetchables because the
563restrictions they're in compare equal independent of their "payload".
564
565
566Exception subclassing
567=====================
568
569It is pretty common for an Exception subclass to want to customize the
570return value of str() on an instance. The easiest way to do that is::
571
572  class MyException(Exception):
573
574      """Describe when it is raised here."""
575
576      def __init__(self, stuff):
577          Exception.__init__(self, 'MyException because of %s' % (stuff,))
578
579This is usually easier than defining a custom __str__ (since you do
580not have to store the value of "stuff" as an attribute) and you should
581be calling the base class __init__ anyway.
582
583(This does not mean you should never store things like "stuff" as
584attrs: it can be very useful for code catching the exception to have
585access to it. Use common sense.)
Note: See TracBrowser for help on using the browser.