A multiformat is a method of embedding metadata describing a value into the
value itself. Such values are called self-describing
values because they
contain information about the value in-band. This contrasts with out-of-band
descriptions of data, such as pre-shared schema files or implicit agreements
about shared data within contexts. File formats with magic values can be
considered a type of self-describing format.
One example of where multiformat is useful is when sharing data hashes. Consider I present you with the following hash of some data:
9ceb0f9889b786a37fd5e754af12f721aa3d8cb18276dd2616d031d2de53a4f2
Many things about this hash are ambiguous, especially its encoding and what type
of hash it is. Given that no single character exceeds f
it is likely this is a
hex-encoded string, but this is just a guess. Assuming this, then the 64
characters describe a 256-bit hash. But what 256-bit hash exactly? There are
many hashes which can produce a 256-bit output, such as SHA-256, SHA3-256,
RIPEMD-256 and more. Furthermore, you only think this string is a hash because I
told you; what if I was not online to tell you?
For long-running systems there are significant downsides to assuming encodings or algorithms will never change. Imagine that a protocol identifies data using MD5 hashes before it is discovered that MD5 is insecure. Such a protocol would face a crisis when needing to switch to a more secure algorithm. Given a more secure algorithm which also produces 128-bit hashes1 it is impossible to distinguish it from an MD5 without some hack.
Multiformat aims to eliminate this ambiguity. Furthermore, multiformat aims to make the encoded data forward-compatible, addressing the problem given in the above example. There are a few multiformat specifications out there; these are the most mature ones:
All of these specifications were created to address needs which arose during the
implementation of IPFS. Multihash was the first to be created and is present
in CIDv0. It describes the hashing algorithm which was used to create the CID.
When adding a new file to IPFS with CIDv0, the ipfs binary defaults to using
SHA2-256. When encoded, this visually becomes a Qm
prefix, and explains why
many IPFS CIDs begin with Qm
.
CIDv1 added multibase and multicodec to CIDs. Where previously it was assumed to
always be base base58-btc
and the codec to always be dag-pb
, the base and
codec are now explicitly encoded into the CID.
These are not just relevant to IPFS but can be used on any platform which indexes data by its hash. Bluesky, for instance, uses CIDv1. I use the same CID format to index images on my HooYa platform.
Example
Consider this picture from my article on distributed hash tables.
This image can be represented by the identifier
bafkreie45mhzrcnxq2rx7vphksxrf5zbvi6yzmmco3osmfwqghjn4u5e6i
(CID
explorer).
This CID indicates that it itself is base32 encoded ('b'), hashed the data using
sha2-256
, and that the data at this hash should be treated as raw data. The
same data can be identified by the string
bafkr2qcpiobgk7f3jrowospr6rdeyhasc3lfyeuosms22pi35dhsp5af4pjusnz7l67xyd7vogirmr2dnnf75zbc2xjmkkyne72azxdctlgie
(CID
explorer),
an identifier which is base32 encoded ('b') and which identifies raw data, but
which was made by hashing data with the keccak-512
hashing algorithm.
It is unlikely a protocol would switch from MD5 to another 128-bit cipher because 128 is far too few bits in the current day. The hash type could be deduced by its encoded length. But still! ↩