| 1 | ========================= |
|---|
| 2 | python code guidelines. |
|---|
| 3 | ========================= |
|---|
| 4 | |
|---|
| 5 | lingo: |
|---|
| 6 | |
|---|
| 7 | fex |
|---|
| 8 | for example |
|---|
| 9 | imo |
|---|
| 10 | in my opinion; 'my' refering to the author |
|---|
| 11 | |
|---|
| 12 | Note not all of the existing code follows this. |
|---|
| 13 | Doesn't mean existing code is correct however ;) |
|---|
| 14 | |
|---|
| 15 | Stats are all from a sempron 1.6ghz with python 2.4.2 |
|---|
| 16 | |
|---|
| 17 | Finally, code _should_ be documented, following epydoc/epytext guidelines |
|---|
| 18 | |
|---|
| 19 | Follow pep8, with following exemptions |
|---|
| 20 | ====================================== |
|---|
| 21 | |
|---|
| 22 | - <80 char limit is only applicable where it doesn't make the logic |
|---|
| 23 | ugly. This is not an excuse to have a 200 char if statement (fix |
|---|
| 24 | your logic). Use common sense. |
|---|
| 25 | - combining imports is ok. |
|---|
| 26 | - use absolute imports |
|---|
| 27 | - _simple_ try/except combined lines are acceptable, but not forced |
|---|
| 28 | (your call). example:: |
|---|
| 29 | |
|---|
| 30 | try: l.remove(blah) |
|---|
| 31 | except IndexError: pass |
|---|
| 32 | |
|---|
| 33 | - for comments. 2 spaces trailing is pointless- not needed. |
|---|
| 34 | - classes should be named SomeClass, functions/methods should be named |
|---|
| 35 | some_func. |
|---|
| 36 | - Exceptions are classes. Don't raise strings either. |
|---|
| 37 | - avoid __var 'private' attributes unless you absolutely have a reason |
|---|
| 38 | to hide it, and the class won't be inherited (or that attribute |
|---|
| 39 | must _not_ be accessed) |
|---|
| 40 | - using string module functions when you could use a string method is |
|---|
| 41 | evil. Don't do it. |
|---|
| 42 | - use isinstance(str_instance, basestring) unless you _really_ need to |
|---|
| 43 | know if it's utf8/ascii |
|---|
| 44 | |
|---|
| 45 | Throw self with a NotImplementedError |
|---|
| 46 | ===================================== |
|---|
| 47 | |
|---|
| 48 | Reason for this is simple; if you just throw a NotImplementedError, you can't |
|---|
| 49 | tell how the path was hit if derivative classes are involved; thus throw |
|---|
| 50 | NotImplementedError(self, string_name_of_attr) |
|---|
| 51 | |
|---|
| 52 | Far better tbs. |
|---|
| 53 | |
|---|
| 54 | Be aware of what the interpretter is actually doing. |
|---|
| 55 | ==================================================== |
|---|
| 56 | |
|---|
| 57 | Don't use len(list_instance) when you just want to know if it's |
|---|
| 58 | nonempty/empty:: |
|---|
| 59 | |
|---|
| 60 | l=[1] |
|---|
| 61 | if l: blah |
|---|
| 62 | # instead of |
|---|
| 63 | if len(l): blah |
|---|
| 64 | |
|---|
| 65 | python looks for __nonzero__, then __len__. It's a fair sight faster |
|---|
| 66 | then if you try to be explicit there:: |
|---|
| 67 | |
|---|
| 68 | python -m timeit -s 'l=[]' 'if len(l) > 0: pass' |
|---|
| 69 | 1000000 loops, best of 3: 0.705 usec per loop |
|---|
| 70 | |
|---|
| 71 | python -m timeit -s 'l=[]' 'if len(l): pass' |
|---|
| 72 | 1000000 loops, best of 3: 0.689 usec per loop |
|---|
| 73 | |
|---|
| 74 | python -m timeit -s 'l=[]' 'if l: pass' |
|---|
| 75 | 1000000 loops, best of 3: 0.302 usec per loop |
|---|
| 76 | |
|---|
| 77 | don't explicitly use has_key. rely on the 'in' operator. |
|---|
| 78 | ========================================================= |
|---|
| 79 | |
|---|
| 80 | :: |
|---|
| 81 | |
|---|
| 82 | python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' 'd.has_key(1999999)' |
|---|
| 83 | 1000000 loops, best of 3: 0.512 usec per loop |
|---|
| 84 | |
|---|
| 85 | python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' '1999999 in d' |
|---|
| 86 | 1000000 loops, best of 3: 0.279 usec per loop |
|---|
| 87 | |
|---|
| 88 | Python interprets the 'in' command by using __contains__ on the |
|---|
| 89 | instance. The interpretter is faster at doing getattr's then actual |
|---|
| 90 | python code is; fex, the code above uses d.__contains__ , if you do |
|---|
| 91 | d.has_key or d.__contains__, it's the same speed. Using 'in' instead |
|---|
| 92 | has the interpretter do the lookup, and is faster. |
|---|
| 93 | |
|---|
| 94 | So... be aware of how the interpretter will execute that code. Python |
|---|
| 95 | code specified attribute access is slower then the interpretter doing |
|---|
| 96 | it on it's own. |
|---|
| 97 | |
|---|
| 98 | If in doubt, python -m timeit is your friend. ;-) |
|---|
| 99 | |
|---|
| 100 | Do not use [] or {} as default args in function/method definitions. |
|---|
| 101 | =================================================================== |
|---|
| 102 | |
|---|
| 103 | :: |
|---|
| 104 | |
|---|
| 105 | >>> def f(default=[]): |
|---|
| 106 | >>> default.append(1) |
|---|
| 107 | >>> return default |
|---|
| 108 | >>> print f() |
|---|
| 109 | [1] |
|---|
| 110 | >>> print f() |
|---|
| 111 | [1,1] |
|---|
| 112 | |
|---|
| 113 | When the function/class/method is defined, the default args are |
|---|
| 114 | instantiated _then_, not per call. End result is that if it's a |
|---|
| 115 | mutable default arg, you should use None and test; this is exempted if |
|---|
| 116 | you _know_ the code doesn't mangle the default. |
|---|
| 117 | |
|---|
| 118 | Visible curried functions should have documentation. |
|---|
| 119 | ==================================================== |
|---|
| 120 | |
|---|
| 121 | When using the currying methods (pkgcore.util.currying) for function |
|---|
| 122 | mangling, preserve the documentation via pretty_docs. |
|---|
| 123 | |
|---|
| 124 | If this is exempted, pydoc output for objects isn't incredibly useful. |
|---|
| 125 | |
|---|
| 126 | unit testing. |
|---|
| 127 | ============= |
|---|
| 128 | |
|---|
| 129 | All code _should_ have test case functionality. We use twisted.trial; should |
|---|
| 130 | be running >=2.2 (<2.2 results in false positives in the spawn tests). |
|---|
| 131 | Regressions should be test cased, exempting idiot mistakes (typos fex). |
|---|
| 132 | |
|---|
| 133 | More then willing to look at code that lacks tests, but |
|---|
| 134 | merging/integrating the code requires tests. |
|---|
| 135 | |
|---|
| 136 | One area that is (atm) exempted from this is the ebuild interaction; |
|---|
| 137 | testing that interface is extremely hard, although it _does_ need to |
|---|
| 138 | be implemented. |
|---|
| 139 | |
|---|
| 140 | If tests are missing from code (author didn't write tests initially), |
|---|
| 141 | new tests desired. :) |
|---|
| 142 | |
|---|
| 143 | |
|---|
| 144 | If it's FS related code, it's _usually_ cheaper to try then to ask then try. |
|---|
| 145 | ============================================================================ |
|---|
| 146 | |
|---|
| 147 | ...but you should verify it ;) |
|---|
| 148 | |
|---|
| 149 | |
|---|
| 150 | existing file (but empty to avoid reading overhead):: |
|---|
| 151 | |
|---|
| 152 | echo > dar |
|---|
| 153 | |
|---|
| 154 | python -m 'timeit' -s 'import os' 'os.path.exists("dar") and open("dar").read()' |
|---|
| 155 | 10000 loops, best of 3: 36.4 usec per loop |
|---|
| 156 | |
|---|
| 157 | python -m 'timeit' -s 'import os' $'try:open("dar").read()\nexcept IOError: pass' |
|---|
| 158 | 10000 loops, best of 3: 22 usec per loop |
|---|
| 159 | |
|---|
| 160 | nonexistant file:: |
|---|
| 161 | |
|---|
| 162 | rm foo |
|---|
| 163 | |
|---|
| 164 | python -m 'timeit' -s 'import os' 'os.path.exists("foo") and open("foo").read()' |
|---|
| 165 | 10000 loops, best of 3: 29.8 usec per loop |
|---|
| 166 | |
|---|
| 167 | python -m 'timeit' -s 'import os' $'try:open("foo").read()\nexcept IOError: pass' |
|---|
| 168 | 10000 loops, best of 3: 27.7 usec per loop |
|---|
| 169 | |
|---|
| 170 | Bit of a difference, no? |
|---|
| 171 | |
|---|
| 172 | Note that I qualified this with "If it's FS related code"; syscalls |
|---|
| 173 | aren't cheap- if it's not triggering syscalls, next section is |
|---|
| 174 | relevant. |
|---|
| 175 | |
|---|
| 176 | Catching Exceptions in python code (rather then cpython) isn't cheap. |
|---|
| 177 | ===================================================================== |
|---|
| 178 | |
|---|
| 179 | stats from python-2.4.2 |
|---|
| 180 | |
|---|
| 181 | When an exception is caught:: |
|---|
| 182 | |
|---|
| 183 | python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' $'try: d[1999]\nexcept KeyError: pass' |
|---|
| 184 | 100000 loops, best of 3: 8.7 usec per loop |
|---|
| 185 | |
|---|
| 186 | python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' $'1999 in d and d[1999]' |
|---|
| 187 | 1000000 loops, best of 3: 0.492 usec per loop |
|---|
| 188 | |
|---|
| 189 | When no exception is caught, overhead of try/except setup:: |
|---|
| 190 | |
|---|
| 191 | python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' $'try: d[0]\nexcept KeyError: pass' |
|---|
| 192 | 1000000 loops, best of 3: 0.532 usec per loop |
|---|
| 193 | |
|---|
| 194 | python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' $'d[0]' |
|---|
| 195 | 1000000 loops, best of 3: 0.407 usec per loop |
|---|
| 196 | |
|---|
| 197 | |
|---|
| 198 | Not advocating writing code that doesn't protect itself- just be aware |
|---|
| 199 | of what the code is actually doing, and be aware that exceptions in |
|---|
| 200 | python code are costly due to machinery involved. |
|---|
| 201 | |
|---|
| 202 | Another example is when to use or not to use dict's setdefault or get methods: |
|---|
| 203 | |
|---|
| 204 | key exists:: |
|---|
| 205 | |
|---|
| 206 | # Through exception handling |
|---|
| 207 | python -m timeit -s 'd=dict.fromkeys(range(100))' 'try: x=d[1]' 'except KeyError: x=42' |
|---|
| 208 | 1000000 loops, best of 3: 0.548 usec per loop |
|---|
| 209 | |
|---|
| 210 | # d.get |
|---|
| 211 | python -m timeit -s 'd=dict.fromkeys(range(100))' 'x=d.get(1, 42)' |
|---|
| 212 | 1000000 loops, best of 3: 1.01 usec per loop |
|---|
| 213 | |
|---|
| 214 | |
|---|
| 215 | key doesn't exist:: |
|---|
| 216 | |
|---|
| 217 | # Through exception handling |
|---|
| 218 | python -m timeit -s 'd=dict.fromkeys(range(100))' 'try: x=d[101]' 'except KeyError: x=42' |
|---|
| 219 | 100000 loops, best of 3: 8.8 usec per loop |
|---|
| 220 | |
|---|
| 221 | # d.get |
|---|
| 222 | python -m timeit -s 'd=dict.fromkeys(range(100))' 'x=d.get(101, 42)' |
|---|
| 223 | 1000000 loops, best of 3: 1.05 usec per loop |
|---|
| 224 | |
|---|
| 225 | |
|---|
| 226 | Short version? If you know the key is there, get is slower. If you |
|---|
| 227 | don't, get is your friend. IOW, use it instead of doing a containment |
|---|
| 228 | test then accessing the key. |
|---|
| 229 | |
|---|
| 230 | Of course this only considers the case where the default value is |
|---|
| 231 | simple. If it's something more costly "except" will do relatively |
|---|
| 232 | better since it's not constructing the default value if it's not |
|---|
| 233 | needed. So if in doubt and in a performance-critical piece of code: |
|---|
| 234 | benchmark parts of it with timeit instead of assuming "exceptions are |
|---|
| 235 | slow" or "[] is fast". |
|---|
| 236 | |
|---|
| 237 | cpython 'leaks' vars into local namespace for certain constructs |
|---|
| 238 | ================================================================ |
|---|
| 239 | |
|---|
| 240 | :: |
|---|
| 241 | |
|---|
| 242 | def f(s): |
|---|
| 243 | while True: |
|---|
| 244 | try: |
|---|
| 245 | some_func_that_throws_exception() |
|---|
| 246 | except Exception, e: |
|---|
| 247 | # e exists in this namespace now. |
|---|
| 248 | pass |
|---|
| 249 | # some other code here... |
|---|
| 250 | |
|---|
| 251 | from the code above, e bled into the f namespace- that's referenced |
|---|
| 252 | memory that isn't used, and will linger until the while loop exits. |
|---|
| 253 | |
|---|
| 254 | Python _does_ bleed variables into the local namespace- be aware of |
|---|
| 255 | this, and explicitly delete references you don't need when dealing in |
|---|
| 256 | large objs, especially dealing with exceptions:: |
|---|
| 257 | |
|---|
| 258 | class c: |
|---|
| 259 | d = {} |
|---|
| 260 | for x in range(1000): |
|---|
| 261 | d[x] = x |
|---|
| 262 | |
|---|
| 263 | Granted the class above is contrived, but the thing to note is that |
|---|
| 264 | c.x is now valid- the x from the for loop bleeds into the class |
|---|
| 265 | namespace and stays put. |
|---|
| 266 | |
|---|
| 267 | Don't leave uneeded vars lingering in class namespace. |
|---|
| 268 | |
|---|
| 269 | Variables that leak from for loops _normally_ isn't an issue, just be |
|---|
| 270 | aware it occurs- especially if the var is referencing a large object |
|---|
| 271 | (thus keeping it in memory). |
|---|
| 272 | |
|---|
| 273 | So... for loops leak, list comps leak, dependant on your except |
|---|
| 274 | clause, can leak also. |
|---|
| 275 | |
|---|
| 276 | Unless you need to generate (and save) a range result, use xrange. |
|---|
| 277 | ================================================================== |
|---|
| 278 | |
|---|
| 279 | python -m timeit 'for x in range(10000): pass' |
|---|
| 280 | 100 loops, best of 3: 2.01 msec per loop |
|---|
| 281 | |
|---|
| 282 | $ python -m timeit 'for x in xrange(10000): pass' |
|---|
| 283 | 1000 loops, best of 3: 1.69 msec per loop |
|---|
| 284 | |
|---|
| 285 | Removals from a list aren't cheap, especially left most. |
|---|
| 286 | ======================================================== |
|---|
| 287 | |
|---|
| 288 | If you _do_ need to do left most removals, deque module is your friend. |
|---|
| 289 | |
|---|
| 290 | Rightmost ain't all that cheap either, depending on what idiocy folks |
|---|
| 291 | come up with to try and 'help' the interpretter:: |
|---|
| 292 | |
|---|
| 293 | python -m timeit $'l=range(1000);i=0;\nwhile i < len(l):\n\tif l[i]!="asdf":del l[i]\n\telse:i+=1' |
|---|
| 294 | 100 loops, best of 3: 4.12 msec per loop |
|---|
| 295 | |
|---|
| 296 | python -m timeit $'l=range(1000);\nfor i in xrange(len(l)-1,-1,-1):\n\tif l[i]!="asdf":del l[i]' |
|---|
| 297 | 100 loops, best of 3: 3 msec per loop |
|---|
| 298 | |
|---|
| 299 | python -m timeit 'l=range(1000);l=[x for x in l if x == "asdf"]' |
|---|
| 300 | 1000 loops, best of 3: 1 msec per loop |
|---|
| 301 | |
|---|
| 302 | Granted, that's worst case, but worst case is usually where folks get bit. |
|---|
| 303 | (best case still is faster for list comp btw). |
|---|
| 304 | |
|---|
| 305 | Related note, don't pop unless you have a reason to. |
|---|
| 306 | |
|---|
| 307 | If you're testing for None specifically, be aware of the 'is' operator. |
|---|
| 308 | ======================================================================= |
|---|
| 309 | |
|---|
| 310 | Is avoids the equality protocol, and does a straight ptr comparison:: |
|---|
| 311 | |
|---|
| 312 | python -m timeit '1 != None' |
|---|
| 313 | 1000000 loops, best of 3: 0.721 usec per loop |
|---|
| 314 | |
|---|
| 315 | $ python -m timeit '1 is not None' |
|---|
| 316 | 1000000 loops, best of 3: 0.343 usec per loop |
|---|
| 317 | |
|---|
| 318 | Deprecated/crappy modules |
|---|
| 319 | ========================= |
|---|
| 320 | |
|---|
| 321 | - Don't use types module. Use isinstance (this isn't a speed reason, |
|---|
| 322 | types sucks). |
|---|
| 323 | - Don't use strings module. There are exceptions, but use string |
|---|
| 324 | methods when available. |
|---|
| 325 | - Don't use stat module just to get a stat attribute- fex:: |
|---|
| 326 | import stats |
|---|
| 327 | l=os.stat("asdf")[stat.ST_MODE] |
|---|
| 328 | |
|---|
| 329 | # can be done as (and a bit cleaner) |
|---|
| 330 | l=os.stat("asdf").st_mode |
|---|
| 331 | |
|---|
| 332 | |
|---|
| 333 | Know the exceptions that are thrown, and catch just those you're interested in. |
|---|
| 334 | =============================================================================== |
|---|
| 335 | |
|---|
| 336 | :: |
|---|
| 337 | |
|---|
| 338 | try: |
|---|
| 339 | blah |
|---|
| 340 | except Exception: |
|---|
| 341 | blah2 |
|---|
| 342 | |
|---|
| 343 | ^^^ major issue here. It catches SystemExit exceptions (trigger by |
|---|
| 344 | keyboard interupts); meaning this code, which was just crappy |
|---|
| 345 | exception handling now swallows ctrl+c (meaning it now screws with UI |
|---|
| 346 | code). |
|---|
| 347 | |
|---|
| 348 | Catch what you're interested in *only*. |
|---|
| 349 | |
|---|
| 350 | tuples versus lists. |
|---|
| 351 | ==================== |
|---|
| 352 | |
|---|
| 353 | Former is immutable, latter is mutable. |
|---|
| 354 | |
|---|
| 355 | Latter over-allocates (cpython thing), meaning it takes up more memory |
|---|
| 356 | then is used (this is actually a good thing usually). |
|---|
| 357 | |
|---|
| 358 | If you're generating/storing a lot of sequences that shouldn't be |
|---|
| 359 | modified, use tuples. Cheaper in memory, and folks can reference the |
|---|
| 360 | tuple directly without concerns of it being mutated elsewhere. |
|---|
| 361 | |
|---|
| 362 | Using lists there however would require each consumer to copy the list |
|---|
| 363 | to protect themselves from mutation. So... over-allocation + |
|---|
| 364 | allocating a new list for each consumer. |
|---|
| 365 | |
|---|
| 366 | Bad, mm'kay. |
|---|
| 367 | |
|---|
| 368 | for immutable instances (tuples/strings fex), trying to copy them is dumb |
|---|
| 369 | ========================================================================= |
|---|
| 370 | |
|---|
| 371 | fex: copy.copy((1,2,3)) is dumb; nobody makes a mistake that obvious, |
|---|
| 372 | but in larger code folks do (folks even try using [:] to copy a |
|---|
| 373 | string; it returns the same string since it's immutable). |
|---|
| 374 | |
|---|
| 375 | Can't modify them, thus there is no point in trying to make copies of them. |
|---|
| 376 | |
|---|
| 377 | |
|---|
| 378 | __del__ methods mes with garbage collection |
|---|
| 379 | =========================================== |
|---|
| 380 | |
|---|
| 381 | __del__ methods have the annoying side affect of blocking garbage |
|---|
| 382 | collection when that instance is involved in a cycle- basically, the |
|---|
| 383 | interpretter doesn't know what __del__ is going to reference, so its |
|---|
| 384 | unknowable (general case) how to break the cycle. |
|---|
| 385 | |
|---|
| 386 | So... if you're using __del__ methods, make sure the instance doesn't |
|---|
| 387 | wind up in a cycle (whether careful data structs, or weakref usage). |
|---|
| 388 | |
|---|
| 389 | (General) python isn't slow, your algorithm is. |
|---|
| 390 | =============================================== |
|---|
| 391 | |
|---|
| 392 | :: |
|---|
| 393 | |
|---|
| 394 | l = [] |
|---|
| 395 | for x in data_generator(): |
|---|
| 396 | if x not in l: |
|---|
| 397 | l.append(x) |
|---|
| 398 | |
|---|
| 399 | That code is _best_ case O(1) (yielding all 0's fex). Worst case is |
|---|
| 400 | O(N^2). |
|---|
| 401 | |
|---|
| 402 | :: |
|---|
| 403 | |
|---|
| 404 | l=set() |
|---|
| 405 | for x in data_generator(): |
|---|
| 406 | if x not in l: |
|---|
| 407 | l.add(x) |
|---|
| 408 | |
|---|
| 409 | Best/Worst are now constant (not quite due to potential expansion of |
|---|
| 410 | the set internally, but that's ignorable in this case). |
|---|
| 411 | |
|---|
| 412 | Further, the first loop actually invokes the __eq__ protocol for x for |
|---|
| 413 | each element, which can potentially be *quite* slow if dealing in |
|---|
| 414 | complex objs. |
|---|
| 415 | |
|---|
| 416 | The second loop invokes __hash__ once on x instead. Seeing the gain? |
|---|
| 417 | |
|---|
| 418 | Technically, second loop still is a bit innefficient:: |
|---|
| 419 | |
|---|
| 420 | l=set(data_generator()) |
|---|
| 421 | |
|---|
| 422 | being simpler and faster. |
|---|
| 423 | |
|---|
| 424 | example data for folks who don't see how _bad_ this can get:: |
|---|
| 425 | |
|---|
| 426 | python -m timeit $'l=[]\nfor x in xrange(1000):\n\tif x not in l:l.append(x)' |
|---|
| 427 | 10 loops, best of 3: 74.4 msec per loop |
|---|
| 428 | |
|---|
| 429 | python -m timeit $'l=set()\nfor x in xrange(1000):\n\tif x not in l:l.add(x)' |
|---|
| 430 | 1000 loops, best of 3: 1.24 msec per loop |
|---|
| 431 | |
|---|
| 432 | python -m timeit 'l=set(xrange(1000))' |
|---|
| 433 | 1000 loops, best of 3: 278 usec per loop |
|---|
| 434 | |
|---|
| 435 | Bit of a difference, no? |
|---|
| 436 | |
|---|
| 437 | This does _not_ mean that sets are automatically/better everywhere, |
|---|
| 438 | just be aware of what you're doing- for a single search of a range |
|---|
| 439 | (fex), the overhead of is far slower then a linear search. Kind of a |
|---|
| 440 | 'duh', but folks do do this sometimes:: |
|---|
| 441 | |
|---|
| 442 | python -m timeit -s 'l=range(50)' $'if 1001 in set(l): pass' |
|---|
| 443 | 100000 loops, best of 3: 12.2 usec per loop |
|---|
| 444 | |
|---|
| 445 | python -m timeit -s 'l=range(50)' $'if 1001 in l: pass' |
|---|
| 446 | 10000 loops, best of 3: 7.68 usec per loop |
|---|
| 447 | |
|---|
| 448 | What's up with __hash__ and dicts |
|---|
| 449 | ================================= |
|---|
| 450 | |
|---|
| 451 | A bunch of things (too many things most likely) in the codebase define |
|---|
| 452 | __hash__. The rule for __hash__ is (quoted from |
|---|
| 453 | http://docs.python.org/ref/customization.html): |
|---|
| 454 | |
|---|
| 455 | Should return a 32-bit integer usable as a hash value for dictionary |
|---|
| 456 | operations. The only required property is that objects which compare |
|---|
| 457 | equal have the same hash value. |
|---|
| 458 | |
|---|
| 459 | Quick/rough explanation for folks who do not know how a "dict" works |
|---|
| 460 | internally: |
|---|
| 461 | |
|---|
| 462 | - Things added to it are dumped in a "bucket" depending on their hash |
|---|
| 463 | value. |
|---|
| 464 | - To check if something is in the dict it first determines the bucket |
|---|
| 465 | to check (based on hash value), then does equality checks (__cmp__ |
|---|
| 466 | or __eq__ if there is one, object identity comparison otherwise) for |
|---|
| 467 | everything in the bucket (if there is anything). |
|---|
| 468 | |
|---|
| 469 | So what does this mean? |
|---|
| 470 | |
|---|
| 471 | - There's no reason at all to define your own __hash__ unless you also |
|---|
| 472 | define __eq__ or __cmp__. Behaviour of your object in dicts/sets |
|---|
| 473 | will not change, it will just be slower (since your own __hash__ is |
|---|
| 474 | almost certainly slower than the default one). |
|---|
| 475 | - If you define __eq__ or __cmp__ and want your object to be usable in |
|---|
| 476 | a dict you have to define __hash__. If you don't the default |
|---|
| 477 | __hash__ is used which means your objects act in dicts like only |
|---|
| 478 | object identity matters *until* you hit a hash collision and your |
|---|
| 479 | own __eq__ or __cmp__ kicks in. |
|---|
| 480 | - If you do define your own __hash__ it has to produce the same value |
|---|
| 481 | for objects that compare equal, or you get *really* weird behaviour |
|---|
| 482 | in dicts/sets ("thing in dict" returning False because the hash |
|---|
| 483 | values differ while "thing in dict.keys()" returns True because that |
|---|
| 484 | does not use the hash value, only does equality checks). |
|---|
| 485 | - If the hash value changes after the object was put in a dict you get |
|---|
| 486 | weird behaviour too ("s=set([thing]); thing.change_hash();thing in s" |
|---|
| 487 | is False, but "thing in list(s)" is True). So if your objects are |
|---|
| 488 | mutable they can usually provide __eq__/__cmp__ but not __hash__. |
|---|
| 489 | - Not having many hash "collisions" (same hash value for objects that |
|---|
| 490 | compare nonequal) is good, but collisions are not illegal. Too many |
|---|
| 491 | of them just slow down dict/set operations (in a worst case scenario |
|---|
| 492 | of the same hash for every object dict/set operations become linear |
|---|
| 493 | searches through the single hash bucket everything ends up in). |
|---|
| 494 | - If you use the hash value directly keep in mind that collisions are |
|---|
| 495 | legal. Do not use comparisons of hash values as a substitute for |
|---|
| 496 | comparing objects (implementing __eq__ / __cmp__). Probably the only |
|---|
| 497 | legitimate use of hash() is to determine an object's hash value |
|---|
| 498 | based on things used for comparison. |
|---|
| 499 | |
|---|
| 500 | |
|---|
| 501 | __eq__ and __ne__ |
|---|
| 502 | ================= |
|---|
| 503 | |
|---|
| 504 | From http://docs.python.org/ref/customization.html: |
|---|
| 505 | |
|---|
| 506 | There are no implied relationships among the comparison operators. |
|---|
| 507 | The truth of x==y does not imply that x!=y is false. Accordingly, |
|---|
| 508 | when defining __eq__(), one should also define __ne__() so that the |
|---|
| 509 | operators will behave as expected. |
|---|
| 510 | |
|---|
| 511 | They really mean that. If you define __eq__ but not __ne__ doing "!=" |
|---|
| 512 | on instances compares them by identity. This is surprisingly easy to |
|---|
| 513 | miss, especially since the natural way to write unit tests for classes |
|---|
| 514 | with custom comparisons goes like this:: |
|---|
| 515 | |
|---|
| 516 | self.assertEquals(YourClass(1), YourClass(1)) |
|---|
| 517 | # Repeat for more possible values. Uses == and therefore __eq__, |
|---|
| 518 | # behaves as expected. |
|---|
| 519 | self.assertNotEquals(YourClass(1), YourClass(2)) |
|---|
| 520 | # Repeat for more possible values. Uses != and therefore object |
|---|
| 521 | # identity, so they all pass (all different instances)! |
|---|
| 522 | |
|---|
| 523 | So you end up only testing __eq__ on equal values (it can say |
|---|
| 524 | "identical" for different values without you noticing). |
|---|
| 525 | |
|---|
| 526 | Adding a __ne__ that just does "return not self == other" fixes this. |
|---|
| 527 | |
|---|
| 528 | |
|---|
| 529 | __eq__/__hash__ and subclassing |
|---|
| 530 | =============================== |
|---|
| 531 | |
|---|
| 532 | If your class has a custom __eq__ and it might be subclassed you have |
|---|
| 533 | to be very careful about how you "compare" to instances of a subclass. |
|---|
| 534 | Usually you will want to be "different" from those unconditionally:: |
|---|
| 535 | |
|---|
| 536 | def __eq__(self, other): |
|---|
| 537 | if self.__class is not YourClass or other.__class__ is not YourClass: |
|---|
| 538 | return False |
|---|
| 539 | # Your actual code goes here |
|---|
| 540 | |
|---|
| 541 | This might seem like overkill, but is necessary to avoid problems if |
|---|
| 542 | you are subclassed and the subclass does not have a new __eq__. If you |
|---|
| 543 | just do an "isinstance(other, self.__class__)" check you will compare |
|---|
| 544 | equal to instances of a subclass, which is usually not what you want. |
|---|
| 545 | If you just check for "self.__class__ is other.__class__" then |
|---|
| 546 | subclasses that add a new attribute without overriding __eq__ will |
|---|
| 547 | compare equal when they should not (because the new attribute |
|---|
| 548 | differs). |
|---|
| 549 | |
|---|
| 550 | If you subclass something that has an __eq__ you should most likely |
|---|
| 551 | override it (you might get away with not doing so if the class does |
|---|
| 552 | not do the type check demonstrated above). If you add a new attribute |
|---|
| 553 | don't forget to override __hash__ too (that is not critical, but you |
|---|
| 554 | will have unnecessary hash collisions if you forget it). |
|---|
| 555 | |
|---|
| 556 | This is especially important for pkgcore because of |
|---|
| 557 | pkgcore.util.caching. If an instance of a class with a broken __eq__ |
|---|
| 558 | is used as argument for the __init__ of a class that uses |
|---|
| 559 | caching.WeakInstMeta it will cause a cached instance to be used when |
|---|
| 560 | it should not. Notice the class with the broken __eq__ does not have |
|---|
| 561 | to be cached itself to trigger this! Getting this wrong can cause fun |
|---|
| 562 | behaviour like atoms showing up in the list of fetchables because the |
|---|
| 563 | restrictions they're in compare equal independent of their "payload". |
|---|
| 564 | |
|---|
| 565 | |
|---|
| 566 | Exception subclassing |
|---|
| 567 | ===================== |
|---|
| 568 | |
|---|
| 569 | It is pretty common for an Exception subclass to want to customize the |
|---|
| 570 | return value of str() on an instance. The easiest way to do that is:: |
|---|
| 571 | |
|---|
| 572 | class MyException(Exception): |
|---|
| 573 | |
|---|
| 574 | """Describe when it is raised here.""" |
|---|
| 575 | |
|---|
| 576 | def __init__(self, stuff): |
|---|
| 577 | Exception.__init__(self, 'MyException because of %s' % (stuff,)) |
|---|
| 578 | |
|---|
| 579 | This is usually easier than defining a custom __str__ (since you do |
|---|
| 580 | not have to store the value of "stuff" as an attribute) and you should |
|---|
| 581 | be calling the base class __init__ anyway. |
|---|
| 582 | |
|---|
| 583 | (This does not mean you should never store things like "stuff" as |
|---|
| 584 | attrs: it can be very useful for code catching the exception to have |
|---|
| 585 | access to it. Use common sense.) |
|---|