Skip to content

Dictionaries and SetsπŸ”—

  • dict type is a fundamental part of python’s implementation.
  • Highly optimized using hash tables, other built-in types based on hash-tables are set and frozenset offering richer APIs and operators.

Modern dict SyntaxπŸ”—

dict Comprehensions (dictcomp)πŸ”—

dial_codes = [(91, 'India'), (1, 'United States')]
country_dial = {country: code for code, country in dial_codes}
country_dial["India"] # 91
country_codes = {code: country.upper() for country, code in sorted(country_dial.items()) if code < 70}

Unpacking MappingsπŸ”—

  • We can apply ** to more than one argument in function calls. This works when keys are all strings and unique across all arguments (NOTE: Duplicate keyword arguments are forbidden)
def dump(**kwargs):
  return kwargs

# dump(**{'x':1, y = 2, **{'z':3}})
# {'x':1, 'y':2, 'z':3}

# example 2
# {'a': 0, **{'x': 1}, 'y': 2, **{'z': 3, 'x': 4}}
# {'a': 0, 'x': 4, 'y': 2, 'z': 3}

NOTE: in case of duplicate args are overwritten by later occurances.

Merge Mappings with |πŸ”—

d1 = {'a':1, 'b':3}
d2 = {'a':2, 'b':4, 'c':6}
# notice overwritten keys by later set
d1|d2   # {'a': 2, 'b': 4, 'c': 6}

Pattern Matching with MappingsπŸ”—

  • Similar to sequence based pattern matching
def get_creators(record: dict)->list:
  match record:
    case {'type':'book', 'api':2, 'authors':[*names]}:
      return names
    case {'type':'book', 'api':1, 'author': name}:
      return [name]
    case {'type':'book'}:
      raise ValueError(f"Invalid 'book' record:{record!r}")
    case {'type': 'movie', 'director':name}:
      return [name]
    case _:
      raise ValueError(f"Invalid Record: {record!r}")  
  • We can handle semi-structured data like JSON records
from collection import OrderedDict
b1 = dict(api=1, author'Temp Book', type='book', title='unknown')
get_creators(b1)    # ['unknown']
b2 = OrderedDict(api=2, type='book', title='Mystery of Python', authors:'smk hmk vib'.split())  
get_creators(b2) # ['smk', 'hmk', 'vib']

# notice how order doesn't matter, only keys need to be present correctly

Standard API of Mapping TypesπŸ”—

UML class diagram for Mapping and MutableMapping

  • To implement a custom mapping, its easier to exten collections.UserDict, or to wrap a dict by composition, instead of subclassing these ABCs.
  • Note : only limitation to this is Keys must be Hashable

What is HashableπŸ”—

  • An object is hashable if it has a hash code which never changes during its lifetime and can be compared to other objects.
  • Implements : __hash__() and __eq__()

Common Mapping MethodsπŸ”—

Method Description dict defaultdict OrderedDict
d.clear() Remove all items ● ● ●
d.__contains__(k) k in d ● ● ●
d.copy() Shallow copy ● ● ●
d.__copy__() Support for copy.copy(d) ●
d.default_factory Callable invoked by __missing__ to set missing values
d.__delitem__(k) del d[k]β€”remove item with key k ● ● ●
d.fromkeys(it, [initial]) New mapping from keys in iterable, with optional initial value (defaults to None) ● ● ●
d.get(k, [default]) Get item with key k, return default or None if missing ● ● ●
d.__getitem__(k) d[k]β€”get item with key k ● ● ●
d.items() Get view over itemsβ€”(key, value) pairs ● ● ●
d.__iter__() Get iterator over keys ● ● ●
d.keys() Get view over keys ● ● ●
d.__len__() len(d)β€”number of items ● ● ●
d.__missing__(k) Called when __getitem__ cannot find the key
d.move_to_end(k, [last]) Move k first or last position (last is True by default)
d.__or__(other) Support for d1 | d2 to create new dict merging d1 and d2 (Python β‰₯ 3.9) ● ● ●
d.__ior__(other) Support for d1 |= d2 to update d1 with d2 (Python β‰₯ 3.9) ● ● ●
d.pop(k, [default]) Remove and return value at k, or default or None if missing ● ● ●
d.popitem() Remove and return the last inserted item as (key, value) ● ● ●
d.__reversed__() Support for reverse(d)β€”returns iterator for keys from last to first inserted ●
d.__ror__(other) Support for other | ddβ€”reversed union operator (Python β‰₯ 3.9) ● ● ●
d.setdefault(k, [default]) If k in d, return d[k]; else set d[k] = default and return it ● ● ●
d.__setitem__(k, v) d[k] = vβ€”put v at k ● ● ●
d.update(m, [**kwargs]) Update d with items from mapping or iterable of (key, value) pairs ● ● ●
d.values() Get view over values ● ● ●

Inserting or Updating Mutable ValuesπŸ”—

  • following python’s fail-fast policy, dict access with d[k] raises an error when k doesn’t exist. Many folks know to use d.get(k, default) instead of handling KeyError. However to retrieve mutable value and update it, there is a better way.
# example from zen of python
# we calculate occurances of a word and store it in dict (k,v) : (string, List[int])
# skipped irrelevant code
occ = index.get(word, [])   # search dict if the word exists or not
occ.append(location)    # update the list
index[word] = occ       # reassigned the dict key

# A better approach to this is using one line
index.setdefault(word, []).append(location) # get the key or set the default empty then append

# basically we improved the code
if key not in my_dict:
  my_dict[key] = []
my_dict[key].append(new_value)
# to
my_dict.setdefault(key, []).append(new_value)

Automatic Handling of Missing KeysπŸ”—

  • If we want our dict to handle missing keys and return some specific value then we can either use defaultdict or subclass dict or any other mapping and populate __missing__ method.
# above code can be fixed now as
index = collections.defaultdict(list)   # calls list() function for deafult value
# now code doesn't need to check for missing keys
  • Another example : StrKeyDict0 class that converts nonstring keys to str on lookup
class StrKeyDict0(dict):
  def __missing__(self, key):
    if isinstance(key, str):    # if key is already a string, return KeyError (imp as prevents recursion)
      raise KeyError(key)
    return self[str(key)]           # build key into string and search (prevent recursion here)
  def get(self, key, default=None):
    try:
      return self[key]  # --> calls `__getitem__` -> calls __missing__
    except KeyError:
      return default
  def __contains__(self, key):
    return key in self.keys() or str(key) in self.keys()    # search unmodified key or its string representation, requires because vanilla contains doesn't falls back to __missing__

NOTE: its much better to subclass UserDict than dict. Above example is bad :)

Variation of DictπŸ”—

collections.OrderedDictπŸ”—

  • NOTE: built-in dict also keeps keys ordered, mostly its used due to backward compatibility. Some reasons for using it from python documentation
  • The equality operation for OrderedDict check for matching order
  • popitem() method for OrderedDict has a different signature and accepts optional argument to specify which item is popped.
  • move_to_end() method to efficiently reposition element to endpoint
  • OrderedDict can handle frequent reordering operation better than dict. Suitable for tracking recent accesses (LRU Cache)

NOTE: dict was designed primarily for fast access, maintaing order is secondary while Ordered dict was designed for reordering operations.

collections.ChainMapπŸ”—

ChainMap instance holds a list of mappings that can be searched as one. The lookup is performed on each input mapping in the order it appears in constructor call. returns as soon as key is found in one of those mappings.

d1 = dict(a=1, b= 3)
d2 = dict(a=2, b= 4, c=6)
from collection import ChainMap
chain = ChainMap(d1, d2)
chain['a']  # output 1
chain['a']  # output 6
  • ChainMap instance doesn’t copy input mapping but holds references to them. Updates or insertions affect first input mapping only.
chain['a'] = -1     # changes on d1

NOTE: Its mostly used for implementing interpreters for languages with nested scopes, where each mapping represents a scope context.

import builtins
pylookup = ChainMap(locals(), globals(), vars(builtins))

collections.CounterπŸ”—

A mapping that holds an integer for each key. Updating an existing key adds to its count. This is used for count instances of hashable objects or as a multiset/ Counter implements + and - operators to combine tallies and provides useful methods such as most_common[n] return most commonly updated items and their count.

ct = collections.Counter('abracadabra')
ct      # Counter({'a': 5, 'b': 2, 'r': 2, 'c': 1, 'd': 1})
ct.update('aaaaazzz')
ct      # Counter({'a': 10, 'z': 3, 'b': 2, 'r': 2, 'c': 1, 'd': 1})
ct.most_common(3)   # [('a', 10), ('z', 3), ('b', 2)]

shelve.ShelfπŸ”—

shelve module in the standard library provides persistent storage for a mapping of string keys to Python Objects serialized in the pickle binary format.

The shelve.open module-level fucntion returns a sheve.Shelf instance - a simple key-value DBM Database backed by dbm module, with following properties

  • shelve.Shelf subclasses abc.MutableMapping, so it provides the essential methods we expect of a mapping type.
  • In addition, shelve.Shelf provides a few other I/O management methods, like sync and close.
  • A Shelf instance is a context manager, so you can use a with block to make sure it is closed after use.
  • Keys and values are saved whenever a new value is assigned to a key.
  • The keys must be strings.
  • The values must be objects that the pickle module can serialize.

Subclassing UserDict instead of dictπŸ”—

  • built-in has implementation shortcuts that end up forcing us to override methods that we can just inherit from UserDict with no problems.
  • NOTE: UserDict doesn’t inherit from dict but uses composition : internal dict instance called data which holds the actual items. This avoids undesired recursion when coding special methods like __setitem__ and simplifies the coding of __contains__
import collections
class StrKeyDict(collections.UserDict):
    def __missing__(self, key):
        if isinstance(key, str):
            raise KeyError(key)
        return self[str(key)]
    def __contains__(self, key):
        return str(key) in self.data

    def __setitem__(self, key, item):
        self.data[str(key)] = item

Immutable MappingsπŸ”—

  • All standard library mappings are mutable, but we may need to prevent users from changing mapping by accident.
  • Use cases in hardware programming library (Pingo) the board.pins maps represent physical GPIO ings on device, requires that user don’t change them
  • types module provides a wrapper class called MappingProxyType which, given a mapping returns a mappingproxy instance that is read only but dynamic proxy for the original mapping. Meaning we can update dynamicproxy and see changes in mapping proxy but we can’t update mapping proxy.
from types import MappingProxyType
d = {1: 'A'}
d_proxy = MappingProxyType(d)   # mappingproxy({1: 'A'})
d_proxy[1]  # output : 'A'
d_proxy[2] = 'x'    # doesn't support assignment error
d[2] = 'B'
d_proxy # mappingproxy({1: 'A', 2: 'B'})
d_proxy[2]  # output : 'B'

Dictionary ViewsπŸ”—

The dict instance methods .keys(), .values() and .items() return instances of classes called dict_keys, dict_values and dict_items. These dictionary views are read-only projection of internal data structures used in dict implementation.

  • Avoids memory overhead or equivalent python2 methods that returned lists duplicating data already in the target dict and they also replace old methods that returned iterators.
d = dict(a=10, b=20, c=30)
values = d.values()
values # dict_values([10, 20, 30])  1
len(values)  # 3
list(values) # convert into list, as views are iterable [10, 20, 30]
reversed(values) # <dict_reversevalueiterator object at 0x10e9e7310>
values[0]   # we cannot update these views
# Traceback (most recent call last):
  # File "<stdin>", line 1, in <module>
# TypeError: 'dict_values' object is not subscriptable
  • A view object is a dynamic proxy. If source dict is updated you can immidediately see changes through an existing views.

Practical Consequences of How dict WorksπŸ”—

The hash table implementation of Python’s dict is very efficient, but it’s important to understand the practical effects of this design:

  • Keys must be hashable objects. They must implement proper __hash__ and __eq__ methods.
  • Item access by key is very fast. A dict may have millions of keys, but Python can locate a key directly by computing the hash code of the key and deriving an index offset into the hash table, with the possible overhead of a small number of tries to find a matching entry.
  • Key ordering is preserved as a side effect of a more compact memory layout for dict in CPython 3.6, which became an official language feature in 3.7.
  • Despite its new compact layout, dicts inevitably have a significant memory overhead. The most compact internal data structure for a container would be an array of pointers to the items.8 Compared to that, a hash table needs to store more data per entry, and Python needs to keep at least one-third of the hash table rows empty to remain efficient.
  • To save memory, avoid creating instance attributes outside of the __init__ method.

Set TheoryπŸ”—

NOTE: set is mutable while its sibling frozenset is immutable

  • A set is a collection of unique objects
  • set elements must be hashable. The set type is not hashable, so you can’t build a set with nested set. But frozenset is hashable so we can have it inside set
  • set supports multiple infix operators like
    • a | b union
    • a & b intersection
    • a - b difference
    • a^b symmetric difference
l = ['spam', 'spam', 'bacon', 'eggs', 'eggs']
set(l) # {'spam', 'bacon', 'eggs'}
list(set(l))    # remove duplicates from a list
# NOTE : unrelated
# remove duplicates but also preserve order of first occurence of each items, we can use plain dict
dict.fromkeys(l).keys()
# dict_keys(['spam','eggs', 'bacon'])
list(dict.fromkeys(l).keys())
# count occurrences of needles in a haystack, both of type set
found = len(needles & haystack)

Set LiteralsπŸ”—

  • NOTE: there is no representation for empty set : use set() because {} already declares an empty dict
  • NOTE: st = {1, 2, 3} is much faster than st = set([1, 2, 3]).
  • There is no syntax to represent frozenset so we have to use constructor like this frozenset(range(10))

Set ComprehensionsπŸ”—

from unicodedata import name
# set of Latin-1 characters that have sign in their unicode names
{chr(i) for i in range(32, 256) if 'SIGN' in name(chr(i), '')}

Practical Consequences of How set WorksπŸ”—

set and frozenset are both implemented with a hash table with following effects.

  • set elements must be hashable. They must implement proper __hash__ and __eq__ methods.
  • Membership testing is very efficient
  • sets have significant memory overhead compared to low-level array pointers to its elements
  • element ordering depends on insertion order but not in a useful or reliable way.
  • adding elements to a set ma change order of existing elements. Because algorithm becomes less efficient if the hash table is more than two-thirds full, so python may need to move and resize table as it grows.

Set Operations on dict ViewsπŸ”—

Method frozenset dict_keys dict_items Description
s.__and__(z) ● ● ● s & z (intersection of s and z)
s.__rand__(z) ● ● ● Reversed & operator
s.__contains__() ● ● ● e in s
s.copy() ● Shallow copy of s
s.difference(it, …) ● Difference between s and iterables it, etc.
s.intersection(it, …) ● Intersection of s and iterables it, etc.
s.isdisjoint(z) ● ● ● s and z are disjoint (no elements in common)
s.issubset(it) ● s is a subset of iterable it
s.issuperset(it) ● s is a superset of iterable it
s.__iter__() ● ● ● Get iterator over s
s.__len__() ● ● ● len(s)
s.__or__(z) ● ● ● s
s.__ror__() ● ● ● Reversed
s.__reversed__() ● ● Get iterator over s in reverse order
s.__rsub__(z) ● ● ● Reversed - operator
s.__sub__(z) ● ● ● s - z (difference between s and z)
s.symmetric_difference(it) ● Complement of s & set(it)
s.union(it, …) ● Union of s and iterables it, etc.
s.__xor__() ● ● ● s ^ z (symmetric difference of s and z)
s.__rxor__() ● ● ● Reversed ^ operator