Dictionaries and Setsπ
dict
type is a fundamental part of pythonβs implementation.- Highly optimized using hash tables, other built-in types based on hash-tables are
set
andfrozenset
offering richer APIs and operators.
Modern dict Syntaxπ
dict Comprehensions (dictcomp
)π
dial_codes = [(91, 'India'), (1, 'United States')]
country_dial = {country: code for code, country in dial_codes}
country_dial["India"] # 91
country_codes = {code: country.upper() for country, code in sorted(country_dial.items()) if code < 70}
Unpacking Mappingsπ
- We can apply
**
to more than one argument in function calls. This works when keys are all strings and unique across all arguments (NOTE: Duplicate keyword arguments are forbidden)
def dump(**kwargs):
return kwargs
# dump(**{'x':1, y = 2, **{'z':3}})
# {'x':1, 'y':2, 'z':3}
# example 2
# {'a': 0, **{'x': 1}, 'y': 2, **{'z': 3, 'x': 4}}
# {'a': 0, 'x': 4, 'y': 2, 'z': 3}
NOTE: in case of duplicate args are overwritten by later occurances.
Merge Mappings with |π
d1 = {'a':1, 'b':3}
d2 = {'a':2, 'b':4, 'c':6}
# notice overwritten keys by later set
d1|d2 # {'a': 2, 'b': 4, 'c': 6}
Pattern Matching with Mappingsπ
- Similar to sequence based pattern matching
def get_creators(record: dict)->list:
match record:
case {'type':'book', 'api':2, 'authors':[*names]}:
return names
case {'type':'book', 'api':1, 'author': name}:
return [name]
case {'type':'book'}:
raise ValueError(f"Invalid 'book' record:{record!r}")
case {'type': 'movie', 'director':name}:
return [name]
case _:
raise ValueError(f"Invalid Record: {record!r}")
- We can handle semi-structured data like JSON records
from collection import OrderedDict
b1 = dict(api=1, author'Temp Book', type='book', title='unknown')
get_creators(b1) # ['unknown']
b2 = OrderedDict(api=2, type='book', title='Mystery of Python', authors:'smk hmk vib'.split())
get_creators(b2) # ['smk', 'hmk', 'vib']
# notice how order doesn't matter, only keys need to be present correctly
Standard API of Mapping Typesπ
- To implement a custom mapping, its easier to exten
collections.UserDict
, or to wrap adict
by composition, instead of subclassing these ABCs. - Note : only limitation to this is Keys must be Hashable
What is Hashableπ
- An object is hashable if it has a hash code which never changes during its lifetime and can be compared to other objects.
- Implements :
__hash__()
and__eq__()
Common Mapping Methodsπ
Method | Description | dict | defaultdict | OrderedDict |
---|---|---|---|---|
d.clear() |
Remove all items | β | β | β |
d.__contains__(k) |
k in d | β | β | β |
d.copy() |
Shallow copy | β | β | β |
d.__copy__() |
Support for copy.copy(d) |
β | ||
d.default_factory |
Callable invoked by __missing__ to set missing values |
|||
d.__delitem__(k) |
del d[k] βremove item with key k |
β | β | β |
d.fromkeys(it, [initial]) |
New mapping from keys in iterable, with optional initial value (defaults to None) | β | β | β |
d.get(k, [default]) |
Get item with key k, return default or None if missing | β | β | β |
d.__getitem__(k) |
d[k] βget item with key k |
β | β | β |
d.items() |
Get view over itemsβ(key, value) pairs | β | β | β |
d.__iter__() |
Get iterator over keys | β | β | β |
d.keys() |
Get view over keys | β | β | β |
d.__len__() |
len(d) βnumber of items |
β | β | β |
d.__missing__(k) |
Called when __getitem__ cannot find the key |
|||
d.move_to_end(k, [last]) |
Move k first or last position (last is True by default) | |||
d.__or__(other) |
Support for d1 | d2 to create new dict merging d1 and d2 (Python β₯ 3.9) |
β | β | β |
d.__ior__(other) |
Support for d1 |= d2 to update d1 with d2 (Python β₯ 3.9) |
β | β | β |
d.pop(k, [default]) |
Remove and return value at k, or default or None if missing | β | β | β |
d.popitem() |
Remove and return the last inserted item as (key, value) | β | β | β |
d.__reversed__() |
Support for reverse(d) βreturns iterator for keys from last to first inserted |
β | ||
d.__ror__(other) |
Support for other | dd βreversed union operator (Python β₯ 3.9) |
β | β | β |
d.setdefault(k, [default]) |
If k in d, return d[k]; else set d[k] = default and return it | β | β | β |
d.__setitem__(k, v) |
d[k] = v βput v at k |
β | β | β |
d.update(m, [**kwargs]) |
Update d with items from mapping or iterable of (key, value) pairs | β | β | β |
d.values() |
Get view over values | β | β | β |
Inserting or Updating Mutable Valuesπ
- following pythonβs fail-fast policy,
dict
access withd[k]
raises an error whenk
doesnβt exist. Many folks know to used.get(k, default)
instead of handlingKeyError
. However to retrieve mutable value and update it, there is a better way.
# example from zen of python
# we calculate occurances of a word and store it in dict (k,v) : (string, List[int])
# skipped irrelevant code
occ = index.get(word, []) # search dict if the word exists or not
occ.append(location) # update the list
index[word] = occ # reassigned the dict key
# A better approach to this is using one line
index.setdefault(word, []).append(location) # get the key or set the default empty then append
# basically we improved the code
if key not in my_dict:
my_dict[key] = []
my_dict[key].append(new_value)
# to
my_dict.setdefault(key, []).append(new_value)
Automatic Handling of Missing Keysπ
- If we want our
dict
to handle missing keys and return some specific value then we can either usedefaultdict
or subclassdict
or any other mapping and populate__missing__
method.
# above code can be fixed now as
index = collections.defaultdict(list) # calls list() function for deafult value
# now code doesn't need to check for missing keys
- Another example :
StrKeyDict0
class that converts nonstring keys tostr
on lookup
class StrKeyDict0(dict):
def __missing__(self, key):
if isinstance(key, str): # if key is already a string, return KeyError (imp as prevents recursion)
raise KeyError(key)
return self[str(key)] # build key into string and search (prevent recursion here)
def get(self, key, default=None):
try:
return self[key] # --> calls `__getitem__` -> calls __missing__
except KeyError:
return default
def __contains__(self, key):
return key in self.keys() or str(key) in self.keys() # search unmodified key or its string representation, requires because vanilla contains doesn't falls back to __missing__
NOTE: its much better to subclass UserDict
than dict
. Above example is bad :)
Variation of Dictπ
collections.OrderedDictπ
- NOTE: built-in
dict
also keeps keys ordered, mostly its used due to backward compatibility. Some reasons for using it from python documentation - The equality operation for
OrderedDict
check for matching order popitem()
method forOrderedDict
has a different signature and accepts optional argument to specify which item is popped.move_to_end()
method to efficiently reposition element to endpointOrderedDict
can handle frequent reordering operation better thandict
. Suitable for tracking recent accesses (LRU Cache)
NOTE: dict was designed primarily for fast access, maintaing order is secondary while Ordered dict was designed for reordering operations.
collections.ChainMapπ
ChainMap
instance holds a list of mappings that can be searched as one. The lookup is performed on each input mapping in the order it appears in constructor call. returns as soon as key is found in one of those mappings.
d1 = dict(a=1, b= 3)
d2 = dict(a=2, b= 4, c=6)
from collection import ChainMap
chain = ChainMap(d1, d2)
chain['a'] # output 1
chain['a'] # output 6
- ChainMap instance doesnβt copy input mapping but holds references to them. Updates or insertions affect first input mapping only.
NOTE: Its mostly used for implementing interpreters for languages with nested scopes, where each mapping represents a scope context.
collections.Counterπ
A mapping that holds an integer for each key. Updating an existing key adds to its count. This is used for count instances of hashable objects or as a multiset/ Counter
implements +
and -
operators to combine tallies and provides useful methods such as most_common[n]
return most commonly updated items and their count.
ct = collections.Counter('abracadabra')
ct # Counter({'a': 5, 'b': 2, 'r': 2, 'c': 1, 'd': 1})
ct.update('aaaaazzz')
ct # Counter({'a': 10, 'z': 3, 'b': 2, 'r': 2, 'c': 1, 'd': 1})
ct.most_common(3) # [('a', 10), ('z', 3), ('b', 2)]
shelve.Shelfπ
shelve
module in the standard library provides persistent storage for a mapping of string keys to Python Objects serialized in the pickle binary format.
The shelve.open
module-level fucntion returns a sheve.Shelf
instance - a simple key-value DBM Database backed by dbm
module, with following properties
shelve.Shelf
subclassesabc.MutableMapping
, so it provides the essential methods we expect of a mapping type.- In addition,
shelve.Shelf
provides a few other I/O management methods, likesync
andclose
. - A
Shelf
instance is a context manager, so you can use awith
block to make sure it is closed after use. - Keys and values are saved whenever a new value is assigned to a key.
- The keys must be strings.
- The values must be objects that the
pickle
module can serialize.
Subclassing UserDict instead of dictπ
- built-in has implementation shortcuts that end up forcing us to override methods that we can just inherit from
UserDict
with no problems. - NOTE:
UserDict
doesnβt inherit fromdict
but uses composition : internaldict
instance calleddata
which holds the actual items. This avoids undesired recursion when coding special methods like__setitem__
and simplifies the coding of__contains__
import collections
class StrKeyDict(collections.UserDict):
def __missing__(self, key):
if isinstance(key, str):
raise KeyError(key)
return self[str(key)]
def __contains__(self, key):
return str(key) in self.data
def __setitem__(self, key, item):
self.data[str(key)] = item
Immutable Mappingsπ
- All standard library mappings are mutable, but we may need to prevent users from changing mapping by accident.
- Use cases in hardware programming library (Pingo) the board.pins maps represent physical GPIO ings on device, requires that user donβt change them
types
module provides a wrapper class calledMappingProxyType
which, given a mapping returns amappingproxy
instance that is read only butdynamic proxy
for the original mapping. Meaning we can update dynamicproxy and see changes in mapping proxy but we canβt update mapping proxy.
from types import MappingProxyType
d = {1: 'A'}
d_proxy = MappingProxyType(d) # mappingproxy({1: 'A'})
d_proxy[1] # output : 'A'
d_proxy[2] = 'x' # doesn't support assignment error
d[2] = 'B'
d_proxy # mappingproxy({1: 'A', 2: 'B'})
d_proxy[2] # output : 'B'
Dictionary Viewsπ
The dict
instance methods .keys()
, .values()
and .items()
return instances of classes called dict_keys
, dict_values
and dict_items
. These dictionary views are read-only projection of internal data structures used in dict
implementation.
- Avoids memory overhead or equivalent python2 methods that returned lists duplicating data already in the target
dict
and they also replace old methods that returned iterators.
d = dict(a=10, b=20, c=30)
values = d.values()
values # dict_values([10, 20, 30]) 1
len(values) # 3
list(values) # convert into list, as views are iterable [10, 20, 30]
reversed(values) # <dict_reversevalueiterator object at 0x10e9e7310>
values[0] # we cannot update these views
# Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
# TypeError: 'dict_values' object is not subscriptable
- A view object is a dynamic proxy. If source
dict
is updated you can immidediately see changes through an existing views.
Practical Consequences of How dict Worksπ
The hash table implementation of Pythonβs dict
is very efficient, but itβs important to understand the practical effects of this design:
- Keys must be hashable objects. They must implement proper
__hash__
and__eq__
methods. - Item access by key is very fast. A
dict
may have millions of keys, but Python can locate a key directly by computing the hash code of the key and deriving an index offset into the hash table, with the possible overhead of a small number of tries to find a matching entry. - Key ordering is preserved as a side effect of a more compact memory layout for
dict
in CPython 3.6, which became an official language feature in 3.7. - Despite its new compact layout, dicts inevitably have a significant memory overhead. The most compact internal data structure for a container would be an array of pointers to the items.8 Compared to that, a hash table needs to store more data per entry, and Python needs to keep at least one-third of the hash table rows empty to remain efficient.
- To save memory, avoid creating instance attributes outside of the
__init__
method.
Set Theoryπ
NOTE: set
is mutable while its sibling frozenset
is immutable
- A set is a collection of unique objects
- set elements must be hashable. The
set
type is not hashable, so you canβt build aset
with nestedset
. Butfrozenset
is hashable so we can have it insideset
- set supports multiple infix operators like
a | b
uniona & b
intersectiona - b
differencea^b
symmetric difference
l = ['spam', 'spam', 'bacon', 'eggs', 'eggs']
set(l) # {'spam', 'bacon', 'eggs'}
list(set(l)) # remove duplicates from a list
# NOTE : unrelated
# remove duplicates but also preserve order of first occurence of each items, we can use plain dict
dict.fromkeys(l).keys()
# dict_keys(['spam','eggs', 'bacon'])
list(dict.fromkeys(l).keys())
Set Literalsπ
- NOTE: there is no representation for empty set : use
set()
because{}
already declares an empty dict - NOTE:
st = {1, 2, 3}
is much faster thanst = set([1, 2, 3])
. - There is no syntax to represent
frozenset
so we have to use constructor like thisfrozenset(range(10))
Set Comprehensionsπ
from unicodedata import name
# set of Latin-1 characters that have sign in their unicode names
{chr(i) for i in range(32, 256) if 'SIGN' in name(chr(i), '')}
Practical Consequences of How set Worksπ
set
and frozenset
are both implemented with a hash table with following effects.
- set elements must be hashable. They must implement proper
__hash__
and__eq__
methods. - Membership testing is very efficient
- sets have significant memory overhead compared to low-level array pointers to its elements
- element ordering depends on insertion order but not in a useful or reliable way.
- adding elements to a set ma change order of existing elements. Because algorithm becomes less efficient if the hash table is more than two-thirds full, so python may need to move and resize table as it grows.
Set Operations on dict Viewsπ
Method | frozenset | dict_keys | dict_items | Description |
---|---|---|---|---|
s.__and__(z) | β | β | β | s & z (intersection of s and z) |
s.__rand__(z) | β | β | β | Reversed & operator |
s.__contains__() | β | β | β | e in s |
s.copy() | β | Shallow copy of s | ||
s.difference(it, β¦) | β | Difference between s and iterables it, etc. | ||
s.intersection(it, β¦) | β | Intersection of s and iterables it, etc. | ||
s.isdisjoint(z) | β | β | β | s and z are disjoint (no elements in common) |
s.issubset(it) | β | s is a subset of iterable it | ||
s.issuperset(it) | β | s is a superset of iterable it | ||
s.__iter__() | β | β | β | Get iterator over s |
s.__len__() | β | β | β | len(s) |
s.__or__(z) | β | β | β | s |
s.__ror__() | β | β | β | Reversed |
s.__reversed__() | β | β | Get iterator over s in reverse order | |
s.__rsub__(z) | β | β | β | Reversed - operator |
s.__sub__(z) | β | β | β | s - z (difference between s and z) |
s.symmetric_difference(it) | β | Complement of s & set(it) | ||
s.union(it, β¦) | β | Union of s and iterables it, etc. | ||
s.__xor__() | β | β | β | s ^ z (symmetric difference of s and z) |
s.__rxor__() | β | β | β | Reversed ^ operator |