| 1 | ======================== |
|---|
| 2 | Python Code Guidelines |
|---|
| 3 | ======================== |
|---|
| 4 | |
|---|
| 5 | Note that not all of the existing code follows this style guide. |
|---|
| 6 | This doesn't mean existing code is correct. |
|---|
| 7 | |
|---|
| 8 | Stats are all from a sempron 1.6Ghz with python 2.4.2. |
|---|
| 9 | |
|---|
| 10 | Finally, code _should_ be documented, following epydoc/epytext guidelines |
|---|
| 11 | |
|---|
| 12 | Follow pep8, with following exemptions |
|---|
| 13 | ====================================== |
|---|
| 14 | |
|---|
| 15 | - <80 char limit is only applicable where it doesn't make the logic |
|---|
| 16 | ugly. This is not an excuse to have a 200 char if statement (fix |
|---|
| 17 | your logic). Use common sense. |
|---|
| 18 | - Combining imports is ok. |
|---|
| 19 | - Use absolute imports |
|---|
| 20 | - _Simple_ try/except combined lines are acceptable, but not forced |
|---|
| 21 | (this is your call). example:: |
|---|
| 22 | |
|---|
| 23 | try: l.remove(blah) |
|---|
| 24 | except IndexError: pass |
|---|
| 25 | |
|---|
| 26 | - For comments, 2 spaces trailing is pointless- not needed. |
|---|
| 27 | - Classes should be named SomeClass, functions/methods should be named |
|---|
| 28 | some_func. |
|---|
| 29 | - Exceptions are classes. Don't raise strings. |
|---|
| 30 | - Avoid __var 'private' attributes unless you absolutely have a reason |
|---|
| 31 | to hide it, and the class won't be inherited (or that attribute |
|---|
| 32 | must _not_ be accessed) |
|---|
| 33 | - Using string module functions when you could use a string method is |
|---|
| 34 | evil. Don't do it. |
|---|
| 35 | - Use isinstance(str_instance, basestring) unless you _really_ need to |
|---|
| 36 | know if it's utf8/ascii |
|---|
| 37 | |
|---|
| 38 | Throw self with a NotImplementedError |
|---|
| 39 | ===================================== |
|---|
| 40 | |
|---|
| 41 | The reason for this is simple: if you just throw a NotImplementedError, |
|---|
| 42 | you can't tell how the path was hit if derivative classes are involved; |
|---|
| 43 | thus throw NotImplementedError(self, string_name_of_attr) |
|---|
| 44 | |
|---|
| 45 | This gives far better tracebacks. |
|---|
| 46 | |
|---|
| 47 | Be aware of what the interpreter is actually doing |
|---|
| 48 | ================================================== |
|---|
| 49 | |
|---|
| 50 | Don't use len(list_instance) when you just want to know if it's |
|---|
| 51 | nonempty/empty:: |
|---|
| 52 | |
|---|
| 53 | l=[1] |
|---|
| 54 | if l: blah |
|---|
| 55 | # instead of |
|---|
| 56 | if len(l): blah |
|---|
| 57 | |
|---|
| 58 | python looks for __nonzero__, then __len__. It's far faster |
|---|
| 59 | than if you try to be explicit there:: |
|---|
| 60 | |
|---|
| 61 | python -m timeit -s 'l=[]' 'if len(l) > 0: pass' |
|---|
| 62 | 1000000 loops, best of 3: 0.705 usec per loop |
|---|
| 63 | |
|---|
| 64 | python -m timeit -s 'l=[]' 'if len(l): pass' |
|---|
| 65 | 1000000 loops, best of 3: 0.689 usec per loop |
|---|
| 66 | |
|---|
| 67 | python -m timeit -s 'l=[]' 'if l: pass' |
|---|
| 68 | 1000000 loops, best of 3: 0.302 usec per loop |
|---|
| 69 | |
|---|
| 70 | Don't explicitly use has_key. Rely on the 'in' operator |
|---|
| 71 | ======================================================= |
|---|
| 72 | |
|---|
| 73 | :: |
|---|
| 74 | |
|---|
| 75 | python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' 'd.has_key(1999999)' |
|---|
| 76 | 1000000 loops, best of 3: 0.512 usec per loop |
|---|
| 77 | |
|---|
| 78 | python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' '1999999 in d' |
|---|
| 79 | 1000000 loops, best of 3: 0.279 usec per loop |
|---|
| 80 | |
|---|
| 81 | Python interprets the 'in' command by using __contains__ on the |
|---|
| 82 | instance. The interpreter is faster at doing getattr's than actual |
|---|
| 83 | python code is: for example, the code above uses d.__contains__ , if you do |
|---|
| 84 | d.has_key or d.__contains__, it's the same speed. Using 'in' is faster |
|---|
| 85 | because it has the interpreter do the lookup. |
|---|
| 86 | |
|---|
| 87 | So be aware of how the interpreter will execute that code. Python |
|---|
| 88 | code specified attribute access is slower then the interpreter doing |
|---|
| 89 | it on its own. |
|---|
| 90 | |
|---|
| 91 | If you're in doubt, python -m timeit is your friend. ;-) |
|---|
| 92 | |
|---|
| 93 | Do not use [] or {} as default args in function/method definitions |
|---|
| 94 | ================================================================== |
|---|
| 95 | |
|---|
| 96 | :: |
|---|
| 97 | |
|---|
| 98 | >>> def f(default=[]): |
|---|
| 99 | >>> default.append(1) |
|---|
| 100 | >>> return default |
|---|
| 101 | >>> print f() |
|---|
| 102 | [1] |
|---|
| 103 | >>> print f() |
|---|
| 104 | [1,1] |
|---|
| 105 | |
|---|
| 106 | When the function/class/method is defined, the default args are |
|---|
| 107 | instantiated _then_, not per call. The end result of this is that if it's a |
|---|
| 108 | mutable default arg, you should use None and test for it being None; this is |
|---|
| 109 | exempted if you _know_ the code doesn't mangle the default. |
|---|
| 110 | |
|---|
| 111 | Visible curried functions should have documentation |
|---|
| 112 | =================================================== |
|---|
| 113 | |
|---|
| 114 | When using the currying methods (pkgcore.util.currying) for function |
|---|
| 115 | mangling, preserve the documentation via pretty_docs. |
|---|
| 116 | |
|---|
| 117 | If this is exempted, pydoc output for objects isn't incredibly useful. |
|---|
| 118 | |
|---|
| 119 | Unit testing |
|---|
| 120 | ============ |
|---|
| 121 | |
|---|
| 122 | All code _should_ have test case functionality. We use twisted.trial - you |
|---|
| 123 | should be running >=2.2 (<2.2 results in false positives in the spawn tests). |
|---|
| 124 | Regressions should be test cased, exempting idiot mistakes (e.g, typos). |
|---|
| 125 | |
|---|
| 126 | We are more than willing to look at code that lacks tests, but |
|---|
| 127 | actually merging the code to integration requires that it has tests. |
|---|
| 128 | |
|---|
| 129 | One area that is (at the moment) exempted from this is the ebuild interaction; |
|---|
| 130 | testing that interface is extremely hard, although it _does_ need to |
|---|
| 131 | be implemented. |
|---|
| 132 | |
|---|
| 133 | If tests are missing from code (due to tests not being written initially), |
|---|
| 134 | new tests are always desired. |
|---|
| 135 | |
|---|
| 136 | |
|---|
| 137 | If it's FS related code, it's _usually_ cheaper to try then to ask then try |
|---|
| 138 | =========================================================================== |
|---|
| 139 | |
|---|
| 140 | ...but you should verify it ;) |
|---|
| 141 | |
|---|
| 142 | |
|---|
| 143 | existing file (but empty to avoid reading overhead):: |
|---|
| 144 | |
|---|
| 145 | echo > dar |
|---|
| 146 | |
|---|
| 147 | python -m 'timeit' -s 'import os' 'os.path.exists("dar") and open("dar").read()' |
|---|
| 148 | 10000 loops, best of 3: 36.4 usec per loop |
|---|
| 149 | |
|---|
| 150 | python -m 'timeit' -s 'import os' $'try:open("dar").read()\nexcept IOError: pass' |
|---|
| 151 | 10000 loops, best of 3: 22 usec per loop |
|---|
| 152 | |
|---|
| 153 | nonexistant file:: |
|---|
| 154 | |
|---|
| 155 | rm foo |
|---|
| 156 | |
|---|
| 157 | python -m 'timeit' -s 'import os' 'os.path.exists("foo") and open("foo").read()' |
|---|
| 158 | 10000 loops, best of 3: 29.8 usec per loop |
|---|
| 159 | |
|---|
| 160 | python -m 'timeit' -s 'import os' $'try:open("foo").read()\nexcept IOError: pass' |
|---|
| 161 | 10000 loops, best of 3: 27.7 usec per loop |
|---|
| 162 | |
|---|
| 163 | As you can see, there is a bit of a difference. :) |
|---|
| 164 | |
|---|
| 165 | Note that this was qualified with "If it's FS related code"; syscalls |
|---|
| 166 | are not cheap- if it's not triggering syscalls, the next section is |
|---|
| 167 | relevant. |
|---|
| 168 | |
|---|
| 169 | Catching Exceptions in python code (rather then cpython) isn't cheap |
|---|
| 170 | ==================================================================== |
|---|
| 171 | |
|---|
| 172 | stats from python-2.4.2 |
|---|
| 173 | |
|---|
| 174 | When an exception is caught:: |
|---|
| 175 | |
|---|
| 176 | python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' $'try: d[1999]\nexcept KeyError: pass' |
|---|
| 177 | 100000 loops, best of 3: 8.7 usec per loop |
|---|
| 178 | |
|---|
| 179 | python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' $'1999 in d and d[1999]' |
|---|
| 180 | 1000000 loops, best of 3: 0.492 usec per loop |
|---|
| 181 | |
|---|
| 182 | When no exception is caught, overhead of try/except setup:: |
|---|
| 183 | |
|---|
| 184 | python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' $'try: d[0]\nexcept KeyError: pass' |
|---|
| 185 | 1000000 loops, best of 3: 0.532 usec per loop |
|---|
| 186 | |
|---|
| 187 | python -m 'timeit' -s 'd=dict(zip(range(1000), range(1000)))' $'d[0]' |
|---|
| 188 | 1000000 loops, best of 3: 0.407 usec per loop |
|---|
| 189 | |
|---|
| 190 | |
|---|
| 191 | This doesn't advocate writing code that doesn't protect itself- just be aware |
|---|
| 192 | of what the code is actually doing, and be aware that exceptions in |
|---|
| 193 | python code are costly due to the machinery involved. |
|---|
| 194 | |
|---|
| 195 | Another example is when to use or not to use dict's setdefault or get methods: |
|---|
| 196 | |
|---|
| 197 | key exists:: |
|---|
| 198 | |
|---|
| 199 | # Through exception handling |
|---|
| 200 | python -m timeit -s 'd=dict.fromkeys(range(100))' 'try: x=d[1]' 'except KeyError: x=42' |
|---|
| 201 | 1000000 loops, best of 3: 0.548 usec per loop |
|---|
| 202 | |
|---|
| 203 | # d.get |
|---|
| 204 | python -m timeit -s 'd=dict.fromkeys(range(100))' 'x=d.get(1, 42)' |
|---|
| 205 | 1000000 loops, best of 3: 1.01 usec per loop |
|---|
| 206 | |
|---|
| 207 | |
|---|
| 208 | key doesn't exist:: |
|---|
| 209 | |
|---|
| 210 | # Through exception handling |
|---|
| 211 | python -m timeit -s 'd=dict.fromkeys(range(100))' 'try: x=d[101]' 'except KeyError: x=42' |
|---|
| 212 | 100000 loops, best of 3: 8.8 usec per loop |
|---|
| 213 | |
|---|
| 214 | # d.get |
|---|
| 215 | python -m timeit -s 'd=dict.fromkeys(range(100))' 'x=d.get(101, 42)' |
|---|
| 216 | 1000000 loops, best of 3: 1.05 usec per loop |
|---|
| 217 | |
|---|
| 218 | |
|---|
| 219 | The short version of this is: if you know the key is there, dict.get() |
|---|
| 220 | is slower. If you don't, get is your friend. In other words, use it |
|---|
| 221 | instead of doing a containment test and then accessing the key. |
|---|
| 222 | |
|---|
| 223 | Of course this only considers the case where the default value is |
|---|
| 224 | simple. If it's something more costly "except" will do relatively |
|---|
| 225 | better since it's not constructing the default value if it's not |
|---|
| 226 | needed. So if in doubt and in a performance-critical piece of code: |
|---|
| 227 | benchmark parts of it with timeit instead of assuming "exceptions are |
|---|
| 228 | slow" or "[] is fast". |
|---|
| 229 | |
|---|
| 230 | cpython 'leaks' vars into local namespace for certain constructs |
|---|
| 231 | ================================================================ |
|---|
| 232 | |
|---|
| 233 | :: |
|---|
| 234 | |
|---|
| 235 | def f(s): |
|---|
| 236 | while True: |
|---|
| 237 | try: |
|---|
| 238 | some_func_that_throws_exception() |
|---|
| 239 | except Exception, e: |
|---|
| 240 | # e exists in this namespace now. |
|---|
| 241 | pass |
|---|
| 242 | # some other code here... |
|---|
| 243 | |
|---|
| 244 | From the code above, e bled into the f namespace- that's referenced |
|---|
| 245 | memory that isn't used, and will linger until the while loop exits. |
|---|
| 246 | |
|---|
| 247 | Python _does_ bleed variables into the local namespace- be aware of |
|---|
| 248 | this, and explicitly delete references you don't need when dealing in |
|---|
| 249 | large objs, especially dealing with exceptions:: |
|---|
| 250 | |
|---|
| 251 | class c: |
|---|
| 252 | d = {} |
|---|
| 253 | for x in range(1000): |
|---|
| 254 | d[x] = x |
|---|
| 255 | |
|---|
| 256 | While the class above is contrived, the thing to note is that |
|---|
| 257 | c.x is now valid- the x from the for loop bleeds into the class |
|---|
| 258 | namespace and stays put. |
|---|
| 259 | |
|---|
| 260 | Don't leave uneeded vars lingering in class namespace. |
|---|
| 261 | |
|---|
| 262 | Variables that leak from for loops _normally_ aren't an issue, just be |
|---|
| 263 | aware it does occur- especially if the var is referencing a large object |
|---|
| 264 | (thus keeping it in memory). |
|---|
| 265 | |
|---|
| 266 | So... for loops leak, list comps leak, dependant on your except |
|---|
| 267 | clause they can also leak. |
|---|
| 268 | |
|---|
| 269 | Do not go overboard with this though. If your function will exit soon |
|---|
| 270 | do not bother cleaning up variables by hand. If the "leaking" things |
|---|
| 271 | are small do not bother either. |
|---|
| 272 | |
|---|
| 273 | The current code deletes exception instances explicitly much more |
|---|
| 274 | often than it should since this was believed to clean up the traceback |
|---|
| 275 | object. This does not work: the only thing "del e" frees up is the |
|---|
| 276 | exception instance and the arguments passed to its constructor. "del |
|---|
| 277 | e" also takes a small amount of time to run (clearing up all locals |
|---|
| 278 | when the function exits is faster). |
|---|
| 279 | |
|---|
| 280 | Unless you need to generate (and save) a range result, use xrange |
|---|
| 281 | ================================================================= |
|---|
| 282 | |
|---|
| 283 | :: |
|---|
| 284 | python -m timeit 'for x in range(10000): pass' |
|---|
| 285 | 100 loops, best of 3: 2.01 msec per loop |
|---|
| 286 | |
|---|
| 287 | $ python -m timeit 'for x in xrange(10000): pass' |
|---|
| 288 | 1000 loops, best of 3: 1.69 msec per loop |
|---|
| 289 | |
|---|
| 290 | Removals from a list aren't cheap, especially left most |
|---|
| 291 | ======================================================= |
|---|
| 292 | |
|---|
| 293 | If you _do_ need to do left most removals, the deque module is your friend. |
|---|
| 294 | |
|---|
| 295 | Rightmost removals aren't too cheap either, depending on what idiocy people |
|---|
| 296 | come up with to try and 'help' the interpreter:: |
|---|
| 297 | |
|---|
| 298 | python -m timeit $'l=range(1000);i=0;\nwhile i < len(l):\n\tif l[i]!="asdf":del l[i]\n\telse:i+=1' |
|---|
| 299 | 100 loops, best of 3: 4.12 msec per loop |
|---|
| 300 | |
|---|
| 301 | python -m timeit $'l=range(1000);\nfor i in xrange(len(l)-1,-1,-1):\n\tif l[i]!="asdf":del l[i]' |
|---|
| 302 | 100 loops, best of 3: 3 msec per loop |
|---|
| 303 | |
|---|
| 304 | python -m timeit 'l=range(1000);l=[x for x in l if x == "asdf"]' |
|---|
| 305 | 1000 loops, best of 3: 1 msec per loop |
|---|
| 306 | |
|---|
| 307 | Granted, that's worst case, but the worst case is usually where people |
|---|
| 308 | get bitten (note the best case still is faster for list comprehension). |
|---|
| 309 | |
|---|
| 310 | On a related note, don't pop() unless you have a reason to. |
|---|
| 311 | |
|---|
| 312 | If you're testing for None specifically, be aware of the 'is' operator |
|---|
| 313 | ====================================================================== |
|---|
| 314 | |
|---|
| 315 | Is avoids the equality protocol, and does a straight ptr comparison:: |
|---|
| 316 | |
|---|
| 317 | python -m timeit '10000000 != None' |
|---|
| 318 | 1000000 loops, best of 3: 0.721 usec per loop |
|---|
| 319 | |
|---|
| 320 | $ python -m timeit '10000000 is not None' |
|---|
| 321 | 1000000 loops, best of 3: 0.343 usec per loop |
|---|
| 322 | |
|---|
| 323 | |
|---|
| 324 | Note that we're specificially forcing a large int; using 1 under 2.5 is the |
|---|
| 325 | same runtime, the reason for this is that it defaults to an identity check, |
|---|
| 326 | then a comparison; for small ints, python uses singletons, thus identity kicks in. |
|---|
| 327 | |
|---|
| 328 | Deprecated/crappy modules |
|---|
| 329 | ========================= |
|---|
| 330 | |
|---|
| 331 | - Don't use types module. Use isinstance (this isn't a speed reason, |
|---|
| 332 | types sucks). |
|---|
| 333 | - Don't use strings module. There are exceptions, but use string |
|---|
| 334 | methods when available. |
|---|
| 335 | - Don't use stat module just to get a stat attribute- e.g.,:: |
|---|
| 336 | import stats |
|---|
| 337 | l=os.stat("asdf")[stat.ST_MODE] |
|---|
| 338 | |
|---|
| 339 | # can be done as (and a bit cleaner) |
|---|
| 340 | l=os.stat("asdf").st_mode |
|---|
| 341 | |
|---|
| 342 | |
|---|
| 343 | Know the exceptions that are thrown, and catch just those you're interested in |
|---|
| 344 | ============================================================================== |
|---|
| 345 | |
|---|
| 346 | :: |
|---|
| 347 | |
|---|
| 348 | try: |
|---|
| 349 | blah |
|---|
| 350 | except Exception: |
|---|
| 351 | blah2 |
|---|
| 352 | |
|---|
| 353 | There is a major issue here. It catches SystemExit exceptions (triggered by |
|---|
| 354 | keyboard interupts); meaning this code, which was just bad exception handling |
|---|
| 355 | now swallows Ctrl+c (meaning it now screws with UI code). |
|---|
| 356 | |
|---|
| 357 | Catch what you're interested in *only*. |
|---|
| 358 | |
|---|
| 359 | tuples versus lists. |
|---|
| 360 | ==================== |
|---|
| 361 | |
|---|
| 362 | The former is immutable, while the latter is mutable. |
|---|
| 363 | |
|---|
| 364 | Lists over-allocate (a cpython thing), meaning it takes up more memory |
|---|
| 365 | then is used (this is actually a good thing usually). |
|---|
| 366 | |
|---|
| 367 | If you're generating/storing a lot of sequences that shouldn't be |
|---|
| 368 | modified, use tuples. They're cheaper in memory, and people can reference |
|---|
| 369 | the tuple directly without being concerned about it being mutated elsewhere. |
|---|
| 370 | |
|---|
| 371 | However, using lists there would require each consumer to copy the list |
|---|
| 372 | to protect themselves from mutation. So... over-allocation + |
|---|
| 373 | allocating a new list for each consumer. |
|---|
| 374 | |
|---|
| 375 | Bad, mm'kay. |
|---|
| 376 | |
|---|
| 377 | Don't try to copy immutable instances (e.g. tuples/strings) |
|---|
| 378 | =========================================================== |
|---|
| 379 | |
|---|
| 380 | Example: copy.copy((1,2,3)) is dumb; nobody makes a mistake that obvious, |
|---|
| 381 | but in larger code people do (people even try using [:] to copy a |
|---|
| 382 | string; it returns the same string since it's immutable). |
|---|
| 383 | |
|---|
| 384 | You can't modify them, therefore there is no point in trying to make copies of them. |
|---|
| 385 | |
|---|
| 386 | |
|---|
| 387 | __del__ methods mess with garbage collection |
|---|
| 388 | ============================================ |
|---|
| 389 | |
|---|
| 390 | __del__ methods have the annoying side affect of blocking garbage |
|---|
| 391 | collection when that instance is involved in a cycle- basically, the |
|---|
| 392 | interpreter doesn't know what __del__ is going to reference, so it's |
|---|
| 393 | unknowable (general case) how to break the cycle. |
|---|
| 394 | |
|---|
| 395 | So... if you're using __del__ methods, make sure the instance doesn't |
|---|
| 396 | wind up in a cycle (whether careful data structs, or weakref usage). |
|---|
| 397 | |
|---|
| 398 | A general point: python isn't slow, your algorithm is |
|---|
| 399 | ===================================================== |
|---|
| 400 | |
|---|
| 401 | :: |
|---|
| 402 | |
|---|
| 403 | l = [] |
|---|
| 404 | for x in data_generator(): |
|---|
| 405 | if x not in l: |
|---|
| 406 | l.append(x) |
|---|
| 407 | |
|---|
| 408 | That code is _best_ case O(1) (e.g., yielding all 0's). The worst case is |
|---|
| 409 | O(N^2). |
|---|
| 410 | |
|---|
| 411 | :: |
|---|
| 412 | |
|---|
| 413 | l=set() |
|---|
| 414 | for x in data_generator(): |
|---|
| 415 | if x not in l: |
|---|
| 416 | l.add(x) |
|---|
| 417 | |
|---|
| 418 | Best/Worst are now constant (this isn't strictly true due to the potential |
|---|
| 419 | expansion of the set internally, but that's ignorable in this case). |
|---|
| 420 | |
|---|
| 421 | Furthermore, the first loop actually invokes the __eq__ protocol for x for |
|---|
| 422 | each element, which can potentially be *quite* slow if dealing in |
|---|
| 423 | complex objs. |
|---|
| 424 | |
|---|
| 425 | The second loop invokes __hash__ once on x instead. |
|---|
| 426 | |
|---|
| 427 | Technically, the second loop still is a bit innefficient:: |
|---|
| 428 | |
|---|
| 429 | l=set(data_generator()) |
|---|
| 430 | |
|---|
| 431 | is simpler and faster. |
|---|
| 432 | |
|---|
| 433 | An example data for people who don't see how _bad_ this can get:: |
|---|
| 434 | |
|---|
| 435 | python -m timeit $'l=[]\nfor x in xrange(1000):\n\tif x not in l:l.append(x)' |
|---|
| 436 | 10 loops, best of 3: 74.4 msec per loop |
|---|
| 437 | |
|---|
| 438 | python -m timeit $'l=set()\nfor x in xrange(1000):\n\tif x not in l:l.add(x)' |
|---|
| 439 | 1000 loops, best of 3: 1.24 msec per loop |
|---|
| 440 | |
|---|
| 441 | python -m timeit 'l=set(xrange(1000))' |
|---|
| 442 | 1000 loops, best of 3: 278 usec per loop |
|---|
| 443 | |
|---|
| 444 | The difference here is obvious. |
|---|
| 445 | |
|---|
| 446 | This does _not_ mean that sets are automatically better everywhere, |
|---|
| 447 | just be aware of what you're doing- for a single search of a range, |
|---|
| 448 | the overhead of is far slower then a linear search. While this may |
|---|
| 449 | seem obvious, people do do this sometimes:: |
|---|
| 450 | |
|---|
| 451 | python -m timeit -s 'l=range(50)' $'if 1001 in set(l): pass' |
|---|
| 452 | 100000 loops, best of 3: 12.2 usec per loop |
|---|
| 453 | |
|---|
| 454 | python -m timeit -s 'l=range(50)' $'if 1001 in l: pass' |
|---|
| 455 | 10000 loops, best of 3: 7.68 usec per loop |
|---|
| 456 | |
|---|
| 457 | What's up with __hash__ and dicts |
|---|
| 458 | ================================= |
|---|
| 459 | |
|---|
| 460 | A bunch of things (too many things most likely) in the codebase define |
|---|
| 461 | __hash__. The rule for __hash__ is (quoted from |
|---|
| 462 | http://docs.python.org/ref/customization.html): |
|---|
| 463 | |
|---|
| 464 | Should return a 32-bit integer usable as a hash value for dictionary |
|---|
| 465 | operations. The only required property is that objects which compare |
|---|
| 466 | equal have the same hash value. |
|---|
| 467 | |
|---|
| 468 | Here's a quick rough explanation for people who do not know how a "dict" works |
|---|
| 469 | internally: |
|---|
| 470 | |
|---|
| 471 | - Things added to it are dumped in a "bucket" depending on their hash |
|---|
| 472 | value. |
|---|
| 473 | - To check if something is in the dict it first determines the bucket |
|---|
| 474 | to check (based on hash value), then does equality checks (__cmp__ |
|---|
| 475 | or __eq__ if there is one, otherwise object identity comparison) for |
|---|
| 476 | everything in the bucket (if there is anything). |
|---|
| 477 | |
|---|
| 478 | So what does this mean? |
|---|
| 479 | |
|---|
| 480 | - There's no reason at all to define your own __hash__ unless you also |
|---|
| 481 | define __eq__ or __cmp__. The behaviour of your object in dicts/sets |
|---|
| 482 | will not change, it will just be slower (since your own __hash__ is |
|---|
| 483 | almost certainly slower than the default one). |
|---|
| 484 | - If you define __eq__ or __cmp__ and want your object to be usable in |
|---|
| 485 | a dict you have to define __hash__. If you don't the default |
|---|
| 486 | __hash__ is used which means your objects act in dicts like only |
|---|
| 487 | object identity matters *until* you hit a hash collision and your |
|---|
| 488 | own __eq__ or __cmp__ kicks in. |
|---|
| 489 | - If you do define your own __hash__ it has to produce the same value |
|---|
| 490 | for objects that compare equal, or you get *really* weird behaviour |
|---|
| 491 | in dicts/sets ("thing in dict" returning False because the hash |
|---|
| 492 | values differ while "thing in dict.keys()" returns True because that |
|---|
| 493 | does not use the hash value, only equality checks). |
|---|
| 494 | - If the hash value changes after the object was put in a dict you get |
|---|
| 495 | weird behaviour too ("s=set([thing]); thing.change_hash();thing in s" |
|---|
| 496 | is False, but "thing in list(s)" is True). So if your objects are |
|---|
| 497 | mutable they can usually provide __eq__/__cmp__ but not __hash__. |
|---|
| 498 | - Not having many hash "collisions" (same hash value for objects that |
|---|
| 499 | compare nonequal) is good, but collisions are not illegal. Too many |
|---|
| 500 | of them just slow down dict/set operations (in a worst case scenario |
|---|
| 501 | of the same hash for every object dict/set operations become linear |
|---|
| 502 | searches through the single hash bucket everything ends up in). |
|---|
| 503 | - If you use the hash value directly keep in mind that collisions are |
|---|
| 504 | legal. Do not use comparisons of hash values as a substitute for |
|---|
| 505 | comparing objects (implementing __eq__ / __cmp__). Probably the only |
|---|
| 506 | legitimate use of hash() is to determine an object's hash value |
|---|
| 507 | based on things used for comparison. |
|---|
| 508 | |
|---|
| 509 | |
|---|
| 510 | __eq__ and __ne__ |
|---|
| 511 | ================= |
|---|
| 512 | |
|---|
| 513 | From http://docs.python.org/ref/customization.html: |
|---|
| 514 | |
|---|
| 515 | There are no implied relationships among the comparison operators. |
|---|
| 516 | The truth of x==y does not imply that x!=y is false. Accordingly, |
|---|
| 517 | when defining __eq__(), one should also define __ne__() so that the |
|---|
| 518 | operators will behave as expected. |
|---|
| 519 | |
|---|
| 520 | They really mean that. If you define __eq__ but not __ne__ doing "!=" |
|---|
| 521 | on instances compares them by identity. This is surprisingly easy to |
|---|
| 522 | miss, especially since the natural way to write unit tests for classes |
|---|
| 523 | with custom comparisons goes like this:: |
|---|
| 524 | |
|---|
| 525 | self.assertEqual(YourClass(1), YourClass(1)) |
|---|
| 526 | # Repeat for more possible values. Uses == and therefore __eq__, |
|---|
| 527 | # behaves as expected. |
|---|
| 528 | self.assertNotEqual(YourClass(1), YourClass(2)) |
|---|
| 529 | # Repeat for more possible values. Uses != and therefore object |
|---|
| 530 | # identity, so they all pass (all different instances)! |
|---|
| 531 | |
|---|
| 532 | So you end up only testing __eq__ on equal values (it can say |
|---|
| 533 | "identical" for different values without you noticing). |
|---|
| 534 | |
|---|
| 535 | Adding a __ne__ that just does "return not self == other" fixes this. |
|---|
| 536 | |
|---|
| 537 | |
|---|
| 538 | __eq__/__hash__ and subclassing |
|---|
| 539 | =============================== |
|---|
| 540 | |
|---|
| 541 | If your class has a custom __eq__ and it might be subclassed you have |
|---|
| 542 | to be very careful about how you "compare" to instances of a subclass. |
|---|
| 543 | Usually you will want to be "different" from those unconditionally:: |
|---|
| 544 | |
|---|
| 545 | def __eq__(self, other): |
|---|
| 546 | if self.__class is not YourClass or other.__class__ is not YourClass: |
|---|
| 547 | return False |
|---|
| 548 | # Your actual code goes here |
|---|
| 549 | |
|---|
| 550 | This might seem like overkill, but it is necessary to avoid problems if |
|---|
| 551 | you are subclassed and the subclass does not have a new __eq__. If you |
|---|
| 552 | just do an "isinstance(other, self.__class__)" check you will compare |
|---|
| 553 | equal to instances of a subclass, which is usually not what you want. |
|---|
| 554 | If you just check for "self.__class__ is other.__class__" then |
|---|
| 555 | subclasses that add a new attribute without overriding __eq__ will |
|---|
| 556 | compare equal when they should not (because the new attribute |
|---|
| 557 | differs). |
|---|
| 558 | |
|---|
| 559 | If you subclass something that has an __eq__ you should most likely |
|---|
| 560 | override it (you might get away with not doing so if the class does |
|---|
| 561 | not do the type check demonstrated above). If you add a new attribute |
|---|
| 562 | don't forget to override __hash__ too (that is not critical, but you |
|---|
| 563 | will have unnecessary hash collisions if you forget it). |
|---|
| 564 | |
|---|
| 565 | This is especially important for pkgcore because of |
|---|
| 566 | pkgcore.util.caching. If an instance of a class with a broken __eq__ |
|---|
| 567 | is used as argument for the __init__ of a class that uses |
|---|
| 568 | caching.WeakInstMeta it will cause a cached instance to be used when |
|---|
| 569 | it should not. Notice the class with the broken __eq__ does not have |
|---|
| 570 | to be cached itself to trigger this! Getting this wrong can cause fun |
|---|
| 571 | behaviour like atoms showing up in the list of fetchables because the |
|---|
| 572 | restrictions they're in compare equal independent of their "payload". |
|---|
| 573 | |
|---|
| 574 | |
|---|
| 575 | Exception subclassing |
|---|
| 576 | ===================== |
|---|
| 577 | |
|---|
| 578 | It is pretty common for an Exception subclass to want to customize the |
|---|
| 579 | return value of str() on an instance. The easiest way to do that is:: |
|---|
| 580 | |
|---|
| 581 | class MyException(Exception): |
|---|
| 582 | |
|---|
| 583 | """Describe when it is raised here.""" |
|---|
| 584 | |
|---|
| 585 | def __init__(self, stuff): |
|---|
| 586 | Exception.__init__(self, 'MyException because of %s' % (stuff,)) |
|---|
| 587 | |
|---|
| 588 | This is usually easier than defining a custom __str__ (since you do |
|---|
| 589 | not have to store the value of "stuff" as an attribute) and you should |
|---|
| 590 | be calling the base class __init__ anyway. |
|---|
| 591 | |
|---|
| 592 | (This does not mean you should never store things like "stuff" as |
|---|
| 593 | attrs: it can be very useful for code catching the exception to have |
|---|
| 594 | access to it. Use common sense.) |
|---|