The presentation from SPb Python Interest Group community meetup.
The presentation tells about the dictionaries in Python, reviews the implementation of dictionary in CPython 2.x, dictionary in CPython 3.x, and also recent changes in CPython 3.6. In addition to CPython the dictionaries in alternative Python implementations such as PyPy, IronPython and Jython are reviewed.
3. >>> d = {} # the same as d = dict()
>>> d['a'] = 123
>>> d['b'] = 345
>>> d['c'] = 678
>>> d
{'a': 123, 'c': 678, 'b': 345}
>>> d['b']
345
>>> del d['c']
>>> d
{'a': 123, 'b': 345}
4. Dictionary keys must be hashable
An object is hashable if it has a hash value which never changes during its lifetime
>>> d[list()] = 1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
>>> d[set()] = 2
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'set'
>>> d[dict()] = 3
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'
All of Python’s immutable built-in objects are hashable
5. import random
class A(object):
def __init__(self, index):
self.index = index
def __eq__(self, other):
return True
def __hash__(self):
return random.randint(0, 3)
def __repr__(self):
return 'A%d' % self.index
d = {A(0): 0, A(1): 1, A(2): 2}
print('keys: %s' % d.keys())
print('values: %s' % d.values())
for k in d:
print('%s = %s' % (k, d.get(k, 'not found')))
Random hash is a bad idea
Run 1
keys: [A1, A2, A0]
values: [1, 2, 0]
A1 = 1
A2 = not found
A0 = 0
Run 2
keys: [A1, A0]
values: [2, 0]
A1 = not found
A0 = not found
7. Three kinds of slots in the table:
1) Unused
2) Active
3) Dummy
typedef struct {
Py_ssize_t me_hash;
PyObject *me_key;
PyObject *me_value;
} PyDictEntry;
- Hash table
- Open addressing collision resolution strategy
- Initial size = 8
- Load factor = 2/3
- Growth rate = 2 or 4 (depending on the number of cells used)
- “/Include/dictobject.h”, “/Objects/dictobject.c”, “/Objects/dictnotes.txt”
Dictionary in CPython >2.1
8. ma_fill – is the number of non-NULL keys (sum of Active and Dummy)
ma_used – number of Active items
ma_mask – mask == PyDict_MINSIZE - 1
ma_lookup – lookup function (lookdict_string by default)
#define PyDict_MINSIZE 8
typedef struct _dictobject PyDictObject;
struct _dictobject {
PyObject_HEAD
Py_ssize_t ma_fill;
Py_ssize_t ma_used;
Py_ssize_t ma_mask;
PyDictEntry *ma_table;
PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key,
long hash);
PyDictEntry ma_smalltable[PyDict_MINSIZE];
};
9. Good hash functions are needed
>>> map(hash, [0, 1, 2, 3, 4])
[0, 1, 2, 3, 4]
>>> map(hash, ['abca', 'abcb', 'abcc', 'abcd', 'abce'])
[1540938117, 1540938118, 1540938119, 1540938112, 1540938113]
Modified FNV (Fowler–Noll–Vo) hash function for strings
“-R” option – turns on hash randomization, so that the __hash__() values of str,
bytes and datetime objects are “salted” with an unpredictable random value
>>> map(hash, ['abca', 'abcb', 'abcc', 'abcd', 'abce'])
[-218138032, -218138029, -218138030, -218138027, -218138028]
Hash functions
10. Collision resolution
Collision is a situation that occurs when two distinct pieces of data have the
same hash value.
Probing is a scheme in computer programming for resolving collisions in hash
tables for maintaining a collection of key–value pairs and looking up the value
associated with a given key.
In CPython a pseudo-random probing is used
PERTURB_SHIFT = 5
perturb = hash(key)
while True:
j = (5 * j) + 1 + perturb
perturb >>= PERTURB_SHIFT
index = j % 2**i
See “/Objects/dictobject.c”
In CPython <2.2 used a polynomial-based index computing
11. >>> PyDict_MINSIZE = 8
>>> key = 123
>>> hash(key) % PyDict_MINSIZE
>>> 3
Index computing
>>> mask = PyDict_MINSIZE - 1
>>> hash(key) & mask
>>> 3
Instead of the modulo operation use logical "AND" and the mask
Get least significant bits of the hash:
2 ** i = PyDict_MINSIZE, hence i = 3, i.e. three least significant bits is enough
hash(123) = 123 = 0b1111011
mask = PyDict_MINSIZE - 1 = 8 - 1 = 7 = 0b111
index = hash(123) & mask = 0b1111011 & 0b111 = 0b011 = 3
26. >>> d = {'a': 1}
>>> for i in d:
... d['new item'] = 123
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: dictionary changed size during iteration
Adding item during iteration
35. Dictionary in CPython 3.5
- PEP 412 - Key-Sharing Dictionary
- The DictObject can be in one of two forms: combined table or split table
- Initial size = 4 (split table) or 8 (combined table)
- Maximum dictionary load = (2*n+1)/3
- Growth rate = used*2 + capacity/2
- “/Objects/dict-common.h”, “/Include/dictobject.h”, “/Objects/dictobject.c”,
“/Objects/dictnotes.txt”
typedef struct {
Py_hash_t me_hash;
PyObject *me_key;
PyObject *me_value; /* only meaningful for combined tables */
} PyDictKeyEntry;
struct _dictkeysobject {
Py_ssize_t dk_refcnt;
Py_ssize_t dk_size;
dict_lookup_func dk_lookup;
Py_ssize_t dk_usable;
PyDictKeyEntry dk_entries[1];
};
typedef struct {
PyObject_HEAD
Py_ssize_t ma_used;
PyDictKeysObject *ma_keys;
PyObject **ma_values;
} PyDictObject;
36. Combined table vs split table
Combined table
- For explicit dictionaries (dict() and {})
- ma_values = NULL, dk_refcnt = 1
- Never becomes a split-table dictionary
Split table
- For attribute dictionaries (the__dict__ attribute of an object)
- ma_values != NULL, dk_refcnt >= 1
- Only string (unicode) keys are allowed
- Values are stored in the ma_values array
- When resizing a split dictionary it is converted to a combined table (but if
resizing is as a result of storing an instance attribute, and there is only
instance of a class, then the dictionary will be re-split immediately)
- Lookup function = lookdict_split
37. Dictionary in CPython 3.5
A new kind of slot:
1) Unused
2) Active
3) Dummy
4) Pending (me_key != NULL, me_key != dummy and me_value == NULL)
typedef struct {
Py_hash_t me_hash;
PyObject *me_key;
PyObject *me_value; /* only meaningful for combined tables */
} PyDictKeyEntry;
40. class A():
def __init__(self):
self.a = 1
self.b = 2
self.c = 3
a = A()
print(a.__dict__.__sizeof__()) # 72
b = A()
setattr(a, 'd', 4) # no re-split because of b
print(a.__dict__.__sizeof__()) # 456
Split table
Split table is converted to a combined table
41. Key differences between this implementation and CPython 2.x:
- The table can be split into two parts – the keys and the values
- A new kind of slot
- No more ma_smalltable embedded in the dict
- General dictionaries are slightly larger
- All object dictionaries of a single class can share a single key-table, saving
about 60% memory for such cases (accordint to
https://github.com/python/cpython/blob/3.5/Objects/dictnotes.txt)
Bugs still happens: Unbounded memory growth resizing split-table dicts
(https://bugs.python.org/issue28147)
Summary
42. Hash functions in CPython 3.5
SipHash for strings and bytes (>= CPython 3.4)
- Resistant against hash flooding DoS attacks
- Successfully used in many other languages
Slightly modified hash function for float
PEP 456 – Secure and interchangeable hash algorithm
hash(float("+inf")) == 314159,
hash(float("-inf")) == -314159, was -271828
43. OrderedDict in CPython 3.5
- Doubly-linked-list
- od_fast_nodes hash table that mirrors the od_dict table
- “/Include/odictobject.h”, “/Objects/odictobject.c”
45. Dictionary in PyPy
- Starting from PyPy 2.5.0 – ordereddict is used by default
- Initial size = 16
- Load factor up to 2/3
- Growth rate = 4 (up to 30000 items) or 2
- If a lot of items are deleted the compaction is performed
- “/rpython/rtyper/lltypesystem/rordereddict.py”
struct dicttable {
int num_live_items;
int num_ever_used_items;
int resize_counter;
variable_int *indexes; // byte, short, int, long
dictentry *entries;
...
}
struct dictentry {
PyObject *key;
PyObject *value;
long hash;
bool valid;
}
47. PyDictionary in Jython
- Based on ConcurrentHashMap
- Separate chaining collision resolution
- Initial size = 16, load factor = 0.75, growth rate = 2
- Segments and thread safety
48. PythonDictionary in IronPython
- Based on Dictionary (.NET)
- Separate chaining collision resolution
- Initial size = 0, load factor = 1.0
- Rehashing if the number of collisions >= 100
- Growth rate = 2 (the new size is equal to the next higher prime number) from a set of
primes = {3, 7, 11, 17, 23, 29, 37, 47, 59, 71, 89, 107,… , 4999559, 5999471, 7199369}
51. Dictionary in CPython 3.6
typedef struct {
Py_hash_t me_hash;
PyObject *me_key;
PyObject *me_value; /* only meaningful for combined tables */
} PyDictKeyEntry;
typedef struct {
PyObject_HEAD
Py_ssize_t ma_used; /* number of items in the dictionary */
uint64_t ma_version_tag; /* unique, changes when dict modified */
PyDictKeysObject *ma_keys;
PyObject **ma_values;
} PyDictObject;
- ma_version_tag is added (PEP 509 – Add a private version to dict)
- Initial size = 8 (for split table too)
- Maximum dictionary load = (2*n)/3
- Contributed by INADA Naoki in https://bugs.python.org/issue27350
Four kinds of slots in the table:
1) Unused (index == DKIX_EMPTY == -1)
2) Active (index >= 0 , me_key != NULL and me_value != NULL)
3) Dummy (index == DKIX_DUMMY == -2, only for combined table)
4) Pending (index >= 0 , me_key != NULL and me_value == NULL, only for split table)
52. Dictionary in CPython 3.6
- Added dk_nentries and dk_indices
struct _dictkeysobject {
Py_ssize_t dk_refcnt;
Py_ssize_t dk_size; /* Size of the hash table (dk_indices) */
dict_lookup_func dk_lookup; /* Function to lookup in dk_indices */
Py_ssize_t dk_usable; /* Number of usable entries in dk_entries */
Py_ssize_t dk_nentries; /* Number of used entries in dk_entries */
union {
int8_t as_1[8];
int16_t as_2[4];
int32_t as_4[2];
#if SIZEOF_VOID_P > 4
int64_t as_8[1];
#endif
} dk_indices;
PyDictKeyEntry dk_entries[dk_usable]; /* using DK_ENTRIES macro */
};
54. Key differences between this implementation and CPython 3.5:
- Compact and ordered
- Added dk_indices with type, depending on the size of dictionary
- Added ma_version_tag (PEP 509)
- Initial size for split table is changed to 8
- Maximum dictionary load changed to (2*n)/3
- Deleting item cause converting the dict to the combined table
- Preserving the order of **kwargs in a function (PEP 468) is implemented
- Preserving Class Attribute Definition Order (PEP 520) is implemented
- The memory usage of the new dict() is between 20% and 25% smaller compared
to Python 3.5 (https://docs.python.org/3.6/whatsnew/3.6.html#other-language-
changes)
Summary
55. References
1. The implementation of a dictionary in Python 2.7 https://habrahabr.ru/post/247843/
2. Python hash calculation algorithms http://delimitry.blogspot.com/2014/07/python-hash-calculation-algorithms.html
3. PEP 412 - Key-Sharing Dictionary https://www.python.org/dev/peps/pep-0412/
4. PEP 456 - Secure and interchangeable hash algorithm https://www.python.org/dev/peps/pep-0456/
5. Mirror of the CPython repository https://github.com/python/cpython/
6. Faster, more memory efficient and more ordered dictionaries on PyPy https://morepypy.blogspot.com/2015/01/faster-
more-memory-efficient-and-more.html
7. PyDictionary (Jython API documentation) http://www.jython.org/javadoc/org/python/core/PyDictionary.html
8. Jython repository https://bitbucket.org/jython/jython
9. Java theory and practice: Building a better HashMap http://www.ibm.com/developerworks/library/j-jtp08223/
10. Back to basics: Dictionary part 2, .NET implementation https://blog.markvincze.com/back-to-basics-dictionary-part-2-
net-implementation/
11. http://referencesource.microsoft.com/mscorlib/system/collections/generic/dictionary.cs.html
12. https://github.com/IronLanguages/main/blob/ipy-2.7-maint/Languages/IronPython/IronPython/
13. https://bitbucket.org/pypy/pypy/
14. https://twitter.com/raymondh
15. PEP 509 - Add a private version to dict https://www.python.org/dev/peps/pep-0509/
16. Compact and ordered dict http://bugs.python.org/issue27350
17. What’s New In Python 3.6 https://docs.python.org/3.6/whatsnew/3.6.html
18. PEP 468 - Preserving the order of **kwargs in a function https://www.python.org/dev/peps/pep-0468/
19. PEP 520 - Preserving Class Attribute Definition Order https://www.python.org/dev/peps/pep-0520/
20. https://en.wikipedia.org/
Images from:
http://www.rcreptiles.com/blog/index.php/2008/06/28/read_the_operating_manual_first
http://kiwigamer450.deviantart.com/art/Back-to-The-Past-Logo-567858767
http://beyondplm.com/wp-content/uploads/2014/04/time-paradox-past-future-present.jpg
http://itband.ru/wp-content/uploads/2014/10/Future.jpg
https://en.wikipedia.org/wiki/Hash_table