root/releases/pkgcore/0.2.14/dev-notes/hacking.rst @ masterdriverz%2540gentoo.org-20070309221744-6u30xw5rb8n13cjj

Revision masterdriverz%2540gentoo.org-20070309221744-6u30xw5rb8n13cjj, 21.5 KB (checked in by Charlie Shepherd <masterdriverz@…>, 22 months ago)

Whitespace fixes

Line 
1========================
2 Python Code Guidelines
3========================
4
5Note that not all of the existing code follows this style guide.
6This doesn't mean existing code is correct.
7
8Stats are all from a sempron 1.6Ghz with python 2.4.2.
9
10Finally, code _should_ be documented, following epydoc/epytext guidelines
11
12Follow pep8, with following exemptions
13======================================
14
15- <80 char limit is only applicable where it doesn't make the logic
16  ugly. This is not an excuse to have a 200 char if statement (fix
17  your logic). Use common sense.
18- Combining imports is ok.
19- Use absolute imports
20- _Simple_ try/except combined lines are acceptable, but not forced
21  (this is your call). example::
22
23   try: l.remove(blah)
24   except IndexError: pass
25
26- For comments, 2 spaces trailing is pointless- not needed.
27- Classes should be named SomeClass, functions/methods should be named
28  some_func.
29- Exceptions are classes.  Don't raise strings.
30- Avoid __var 'private' attributes unless you absolutely have a reason
31  to hide it, and the class won't be inherited (or that attribute
32  must _not_ be accessed)
33- Using string module functions when you could use a string method is
34  evil. Don't do it.
35- Use isinstance(str_instance, basestring) unless you _really_ need to
36   know if it's utf8/ascii
37
38Throw self with a NotImplementedError
39=====================================
40
41The reason for this is simple: if you just throw a NotImplementedError,
42you can't tell how the path was hit if derivative classes are involved;
43thus throw NotImplementedError(self, string_name_of_attr)
44
45This gives far better tracebacks.
46
47Be aware of what the interpreter is actually doing
48==================================================
49
50Don't use len(list_instance) when you just want to know if it's
51nonempty/empty::
52
53  l=[1]
54  if l: blah
55  # instead of
56  if len(l): blah
57
58python looks for __nonzero__, then __len__. It's far faster
59than if you try to be explicit there::
60
61  python -m timeit -s 'l=[]' 'if len(l) > 0: pass'
62  1000000 loops, best of 3: 0.705 usec per loop
63
64  python -m timeit -s 'l=[]' 'if len(l): pass'
65  1000000 loops, best of 3: 0.689 usec per loop
66
67  python -m timeit -s 'l=[]' 'if l: pass'
68  1000000 loops, best of 3: 0.302 usec per loop
69
70Don't explicitly use has_key. Rely on the 'in' operator
71=======================================================
72
73::
74
75  python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' 'd.has_key(1999999)'
76  1000000 loops, best of 3: 0.512 usec per loop
77
78  python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' '1999999 in d'
79  1000000 loops, best of 3: 0.279 usec per loop
80
81Python interprets the 'in' command by using __contains__ on the
82instance. The interpreter is faster at doing getattr's than actual
83python code is: for example, the code above uses d.__contains__ , if you do
84d.has_key or d.__contains__, it's the same speed. Using 'in' is faster
85because it has the interpreter do the lookup.
86
87So be aware of how the interpreter will execute that code. Python
88code specified attribute access is slower then the interpreter doing
89it on its own.
90
91If you're in doubt, python -m timeit is your friend. ;-)
92
93Do not use [] or {} as default args in function/method definitions
94==================================================================
95
96::
97
98  >>> def f(default=[]):
99  >>>   default.append(1)
100  >>>   return default
101  >>> print f()
102  [1]
103  >>> print f()
104  [1,1]
105
106When the function/class/method is defined, the default args are
107instantiated _then_, not per call. The end result of this is that if it's a
108mutable default arg, you should use None and test for it being None; this is
109exempted if you _know_ the code doesn't mangle the default.
110
111Visible curried functions should have documentation
112===================================================
113
114When using the currying methods (pkgcore.util.currying) for function
115mangling, preserve the documentation via pretty_docs.
116
117If this is exempted, pydoc output for objects isn't incredibly useful.
118
119Unit testing
120============
121
122All code _should_ have test case functionality.  We use twisted.trial - you
123should be running >=2.2 (<2.2 results in false positives in the spawn tests).
124Regressions should be test cased, exempting idiot mistakes (e.g, typos).
125
126We are more than willing to look at code that lacks tests, but
127actually merging the code to integration requires that it has tests.
128
129One area that is (at the moment) exempted from this is the ebuild interaction;
130testing that interface is extremely hard, although it _does_ need to
131be implemented.
132
133If tests are missing from code (due to tests not being written initially),
134new tests are always desired.
135
136
137If it's FS related code, it's _usually_ cheaper to try then to ask then try
138===========================================================================
139
140...but you should verify it ;)
141
142
143existing file (but empty to avoid reading overhead)::
144
145  echo > dar
146
147  python -m 'timeit' -s 'import os' 'os.path.exists("dar") and open("dar").read()'
148  10000 loops, best of 3: 36.4 usec per loop
149
150  python -m 'timeit' -s 'import os' $'try:open("dar").read()\nexcept IOError: pass'
151  10000 loops, best of 3: 22 usec per loop
152
153nonexistant file::
154
155  rm foo
156
157  python -m 'timeit' -s 'import os' 'os.path.exists("foo") and open("foo").read()'
158  10000 loops, best of 3: 29.8 usec per loop
159
160  python -m 'timeit' -s 'import os' $'try:open("foo").read()\nexcept IOError: pass'
161  10000 loops, best of 3: 27.7 usec per loop
162
163As you can see, there is a bit of a difference. :)
164
165Note that this was qualified with "If it's FS related code"; syscalls
166are not cheap- if it's not triggering syscalls, the next section is
167relevant.
168
169Catching Exceptions in python code (rather then cpython) isn't cheap
170====================================================================
171
172stats from python-2.4.2
173
174When an exception is caught::
175
176  python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' $'try: d[1999]\nexcept KeyError: pass'
177  100000 loops, best of 3: 8.7 usec per loop
178
179  python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' $'1999 in d and d[1999]'
180  1000000 loops, best of 3: 0.492 usec per loop
181
182When no exception is caught, overhead of try/except setup::
183
184  python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' $'try: d[0]\nexcept KeyError: pass'
185  1000000 loops, best of 3: 0.532 usec per loop
186
187  python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' $'d[0]'
188  1000000 loops, best of 3: 0.407 usec per loop
189
190
191This doesn't advocate writing code that doesn't protect itself- just be aware
192of what the code is actually doing, and be aware that exceptions in
193python code are costly due to the machinery involved.
194
195Another example is when to use or not to use dict's setdefault or get methods:
196
197key exists::
198
199  # Through exception handling
200  python -m timeit -s 'd=dict.fromkeys(range(100))' 'try: x=d[1]' 'except KeyError: x=42'
201  1000000 loops, best of 3: 0.548 usec per loop
202
203  # d.get
204  python -m timeit -s 'd=dict.fromkeys(range(100))' 'x=d.get(1, 42)'
205  1000000 loops, best of 3: 1.01 usec per loop
206
207
208key doesn't exist::
209
210  # Through exception handling
211  python -m timeit -s 'd=dict.fromkeys(range(100))' 'try: x=d[101]' 'except KeyError: x=42'
212  100000 loops, best of 3: 8.8 usec per loop
213
214  # d.get
215  python -m timeit -s 'd=dict.fromkeys(range(100))' 'x=d.get(101, 42)'
216  1000000 loops, best of 3: 1.05 usec per loop
217
218
219The short version of this is: if you know the key is there, dict.get()
220is slower. If you don't, get is your friend. In other words, use it
221instead of doing a containment test and then accessing the key.
222
223Of course this only considers the case where the default value is
224simple. If it's something more costly "except" will do relatively
225better since it's not constructing the default value if it's not
226needed. So if in doubt and in a performance-critical piece of code:
227benchmark parts of it with timeit instead of assuming "exceptions are
228slow" or "[] is fast".
229
230cpython 'leaks' vars into local namespace for certain constructs
231================================================================
232
233::
234
235  def f(s):
236      while True:
237          try:
238              some_func_that_throws_exception()
239          except Exception, e:
240              # e exists in this namespace now.
241              pass
242          # some other code here...
243
244From the code above, e bled into the f namespace- that's referenced
245memory that isn't used, and will linger until the while loop exits.
246
247Python _does_ bleed variables into the local namespace- be aware of
248this, and explicitly delete references you don't need when dealing in
249large objs, especially dealing with exceptions::
250
251  class c:
252      d = {}
253      for x in range(1000):
254          d[x] = x
255
256While the class above is contrived, the thing to note is that
257c.x is now valid- the x from the for loop bleeds into the class
258namespace and stays put.
259
260Don't leave uneeded vars lingering in class namespace.
261
262Variables that leak from for loops _normally_ aren't an issue, just be
263aware it does occur- especially if the var is referencing a large object
264(thus keeping it in memory).
265
266So... for loops leak, list comps leak, dependant on your except
267clause they can also leak.
268
269Do not go overboard with this though. If your function will exit soon
270do not bother cleaning up variables by hand. If the "leaking" things
271are small do not bother either.
272
273The current code deletes exception instances explicitly much more
274often than it should since this was believed to clean up the traceback
275object. This does not work: the only thing "del e" frees up is the
276exception instance and the arguments passed to its constructor. "del
277e" also takes a small amount of time to run (clearing up all locals
278when the function exits is faster).
279
280Unless you need to generate (and save) a range result, use xrange
281=================================================================
282
283::
284  python -m timeit 'for x in range(10000): pass'
285  100 loops, best of 3: 2.01 msec per loop
286
287  $ python -m timeit 'for x in xrange(10000): pass'
288  1000 loops, best of 3: 1.69 msec per loop
289
290Removals from a list aren't cheap, especially left most
291=======================================================
292
293If you _do_ need to do left most removals, the deque module is your friend.
294
295Rightmost removals aren't too cheap either, depending on what idiocy people
296come up with to try and 'help' the interpreter::
297
298  python -m timeit $'l=range(1000);i=0;\nwhile i < len(l):\n\tif l[i]!="asdf":del l[i]\n\telse:i+=1'
299  100 loops, best of 3: 4.12 msec per loop
300
301  python -m timeit $'l=range(1000);\nfor i in xrange(len(l)-1,-1,-1):\n\tif l[i]!="asdf":del l[i]'
302  100 loops, best of 3: 3 msec per loop
303
304  python -m timeit 'l=range(1000);l=[x for x in l if x == "asdf"]'
305  1000 loops, best of 3: 1 msec per loop
306
307Granted, that's worst case, but the worst case is usually where people
308get bitten (note the best case still is faster for list comprehension).
309
310On a related note, don't pop() unless you have a reason to.
311
312If you're testing for None specifically, be aware of the 'is' operator
313======================================================================
314
315Is avoids the equality protocol, and does a straight ptr comparison::
316
317  python -m timeit '10000000 != None'
318  1000000 loops, best of 3: 0.721 usec per loop
319
320  $ python -m timeit '10000000 is not None'
321  1000000 loops, best of 3: 0.343 usec per loop
322
323
324Note that we're specificially forcing a large int; using 1 under 2.5 is the
325same runtime, the reason for this is that it defaults to an identity check,
326then a comparison; for small ints, python uses singletons, thus identity kicks in.
327
328Deprecated/crappy modules
329=========================
330
331- Don't use types module. Use isinstance (this isn't a speed reason,
332  types sucks).
333- Don't use strings module. There are exceptions, but use string
334  methods when available.
335- Don't use stat module just to get a stat attribute- e.g.,::
336    import stats
337    l=os.stat("asdf")[stat.ST_MODE]
338
339    # can be done as (and a bit cleaner)
340    l=os.stat("asdf").st_mode
341
342
343Know the exceptions that are thrown, and catch just those you're interested in
344==============================================================================
345
346::
347
348  try:
349      blah
350  except Exception:
351      blah2
352
353There is a major issue here. It catches SystemExit exceptions (triggered by
354keyboard interupts); meaning this code, which was just bad exception handling
355now swallows Ctrl+c (meaning it now screws with UI code).
356
357Catch what you're interested in *only*.
358
359tuples versus lists.
360====================
361
362The former is immutable, while the latter is mutable.
363
364Lists over-allocate (a cpython thing), meaning it takes up more memory
365then is used (this is actually a good thing usually).
366
367If you're generating/storing a lot of sequences that shouldn't be
368modified, use tuples. They're cheaper in memory, and people can reference
369the tuple directly without being concerned about it being mutated elsewhere.
370
371However, using lists there would require each consumer to copy the list
372to protect themselves from mutation. So... over-allocation +
373allocating a new list for each consumer.
374
375Bad, mm'kay.
376
377Don't try to copy immutable instances (e.g. tuples/strings)
378===========================================================
379
380Example: copy.copy((1,2,3)) is dumb; nobody makes a mistake that obvious,
381but in larger code people do (people even try using [:] to copy a
382string; it returns the same string since it's immutable).
383
384You can't modify them, therefore there is no point in trying to make copies of them.
385
386
387__del__ methods mess with garbage collection
388============================================
389
390__del__ methods have the annoying side affect of blocking garbage
391collection when that instance is involved in a cycle- basically, the
392interpreter doesn't know what __del__ is going to reference, so it's
393unknowable (general case) how to break the cycle.
394
395So... if you're using __del__ methods, make sure the instance doesn't
396wind up in a cycle (whether careful data structs, or weakref usage).
397
398A general point: python isn't slow, your algorithm is
399=====================================================
400
401::
402
403  l = []
404  for x in data_generator():
405      if x not in l:
406          l.append(x)
407
408That code is _best_ case O(1) (e.g., yielding all 0's). The worst case is
409O(N^2).
410
411::
412
413  l=set()
414  for x in data_generator():
415      if x not in l:
416          l.add(x)
417
418Best/Worst are now constant (this isn't strictly true due to the potential
419expansion of the set internally, but that's ignorable in this case).
420
421Furthermore, the first loop actually invokes the __eq__ protocol for x for
422each element, which can potentially be *quite* slow if dealing in
423complex objs.
424
425The second loop invokes __hash__ once on x instead.
426
427Technically, the second loop still is a bit innefficient::
428
429  l=set(data_generator())
430
431is simpler and faster.
432
433An example data for people who don't see how _bad_ this can get::
434
435  python -m timeit $'l=[]\nfor x in xrange(1000):\n\tif x not in l:l.append(x)'
436  10 loops, best of 3: 74.4 msec per loop
437
438  python -m timeit $'l=set()\nfor x in xrange(1000):\n\tif x not in l:l.add(x)'
439  1000 loops, best of 3: 1.24 msec per loop
440
441  python -m timeit 'l=set(xrange(1000))'
442  1000 loops, best of 3: 278 usec per loop
443
444The difference here is obvious.
445
446This does _not_ mean that sets are automatically better everywhere,
447just be aware of what you're doing- for a single search of a range,
448the overhead of is far slower then a linear search. While this may
449seem obvious, people do do this sometimes::
450
451  python -m timeit -s 'l=range(50)' $'if 1001 in set(l): pass'
452  100000 loops, best of 3: 12.2 usec per loop
453
454  python -m timeit -s 'l=range(50)' $'if 1001 in l: pass'
455  10000 loops, best of 3: 7.68 usec per loop
456
457What's up with __hash__ and dicts
458=================================
459
460A bunch of things (too many things most likely) in the codebase define
461__hash__. The rule for __hash__ is (quoted from
462http://docs.python.org/ref/customization.html):
463
464 Should return a 32-bit integer usable as a hash value for dictionary
465 operations. The only required property is that objects which compare
466 equal have the same hash value.
467
468Here's a quick rough explanation for people who do not know how a "dict" works
469internally:
470
471- Things added to it are dumped in a "bucket" depending on their hash
472  value.
473- To check if something is in the dict it first determines the bucket
474  to check (based on hash value), then does equality checks (__cmp__
475  or __eq__ if there is one, otherwise object identity comparison) for
476  everything in the bucket (if there is anything).
477
478So what does this mean?
479
480- There's no reason at all to define your own __hash__ unless you also
481  define __eq__ or __cmp__. The behaviour of your object in dicts/sets
482  will not change, it will just be slower (since your own __hash__ is
483  almost certainly slower than the default one).
484- If you define __eq__ or __cmp__ and want your object to be usable in
485  a dict you have to define __hash__. If you don't the default
486  __hash__ is used which means your objects act in dicts like only
487  object identity matters *until* you hit a hash collision and your
488  own __eq__ or __cmp__ kicks in.
489- If you do define your own __hash__ it has to produce the same value
490  for objects that compare equal, or you get *really* weird behaviour
491  in dicts/sets ("thing in dict" returning False because the hash
492  values differ while "thing in dict.keys()" returns True because that
493  does not use the hash value, only equality checks).
494- If the hash value changes after the object was put in a dict you get
495  weird behaviour too ("s=set([thing]); thing.change_hash();thing in s"
496  is False, but "thing in list(s)" is True). So if your objects are
497  mutable they can usually provide __eq__/__cmp__ but not __hash__.
498- Not having many hash "collisions" (same hash value for objects that
499  compare nonequal) is good, but collisions are not illegal. Too many
500  of them just slow down dict/set operations (in a worst case scenario
501  of the same hash for every object dict/set operations become linear
502  searches through the single hash bucket everything ends up in).
503- If you use the hash value directly keep in mind that collisions are
504  legal. Do not use comparisons of hash values as a substitute for
505  comparing objects (implementing __eq__ / __cmp__). Probably the only
506  legitimate use of hash() is to determine an object's hash value
507  based on things used for comparison.
508
509
510__eq__ and __ne__
511=================
512
513From http://docs.python.org/ref/customization.html:
514
515  There are no implied relationships among the comparison operators.
516  The truth of x==y does not imply that x!=y is false. Accordingly,
517  when defining __eq__(), one should also define __ne__() so that the
518  operators will behave as expected.
519
520They really mean that. If you define __eq__ but not __ne__ doing "!="
521on instances compares them by identity. This is surprisingly easy to
522miss, especially since the natural way to write unit tests for classes
523with custom comparisons goes like this::
524
525  self.assertEqual(YourClass(1), YourClass(1))
526  # Repeat for more possible values. Uses == and therefore __eq__,
527  # behaves as expected.
528  self.assertNotEqual(YourClass(1), YourClass(2))
529  # Repeat for more possible values. Uses != and therefore object
530  # identity, so they all pass (all different instances)!
531
532So you end up only testing __eq__ on equal values (it can say
533"identical" for different values without you noticing).
534
535Adding a __ne__ that just does "return not self == other" fixes this.
536
537
538__eq__/__hash__ and subclassing
539===============================
540
541If your class has a custom __eq__ and it might be subclassed you have
542to be very careful about how you "compare" to instances of a subclass.
543Usually you will want to be "different" from those unconditionally::
544
545  def __eq__(self, other):
546      if self.__class is not YourClass or other.__class__ is not YourClass:
547          return False
548      # Your actual code goes here
549
550This might seem like overkill, but it is necessary to avoid problems if
551you are subclassed and the subclass does not have a new __eq__. If you
552just do an "isinstance(other, self.__class__)" check you will compare
553equal to instances of a subclass, which is usually not what you want.
554If you just check for "self.__class__ is other.__class__" then
555subclasses that add a new attribute without overriding __eq__ will
556compare equal when they should not (because the new attribute
557differs).
558
559If you subclass something that has an __eq__ you should most likely
560override it (you might get away with not doing so if the class does
561not do the type check demonstrated above). If you add a new attribute
562don't forget to override __hash__ too (that is not critical, but you
563will have unnecessary hash collisions if you forget it).
564
565This is especially important for pkgcore because of
566pkgcore.util.caching. If an instance of a class with a broken __eq__
567is used as argument for the __init__ of a class that uses
568caching.WeakInstMeta it will cause a cached instance to be used when
569it should not. Notice the class with the broken __eq__ does not have
570to be cached itself to trigger this! Getting this wrong can cause fun
571behaviour like atoms showing up in the list of fetchables because the
572restrictions they're in compare equal independent of their "payload".
573
574
575Exception subclassing
576=====================
577
578It is pretty common for an Exception subclass to want to customize the
579return value of str() on an instance. The easiest way to do that is::
580
581  class MyException(Exception):
582
583      """Describe when it is raised here."""
584
585      def __init__(self, stuff):
586          Exception.__init__(self, 'MyException because of %s' % (stuff,))
587
588This is usually easier than defining a custom __str__ (since you do
589not have to store the value of "stuff" as an attribute) and you should
590be calling the base class __init__ anyway.
591
592(This does not mean you should never store things like "stuff" as
593attrs: it can be very useful for code catching the exception to have
594access to it. Use common sense.)
Note: See TracBrowser for help on using the browser.