Content-addressable storage

From Wikipedia, the free encyclopedia

Content-addressable storage, also referred to as associative storage or abbreviated CAS, is a mechanism for storing information that can be retrieved based on its content, not its storage location. It is typically used for high-speed storage and retrieval of static content, such as documents stored for compliance with government regulations. Roughly speaking, content-addressable storage is the permanent-storage analogue to content-addressable memory.

1 Content-addressed vs. Location-addressable
2 Pros and Cons
3 Typical Implementation
4 Open Source Implementation
5 References
6 See also
7 External links

[edit] Content-addressed vs. Location-addressable

When being contrasted with content-addressed storage, a typical local or networked storage device is referred to as location-addressable. In a location-addressable storage device, each element of data is stored onto the physical medium, and its location recorded for later use. The storage device often keeps a list, or directory, of these locations. When a future request is made for a particular item, the request includes only the location (for example, path and file names) of the data. The storage device can then use this information to locate the data on the physical medium, and retrieve it. When new information is written into a location-addressed device, it is simply stored in some available free space, without regard to its content. The information at a given location can usually be altered or completely overwritten without any special action on the part of the storage device.

In contrast, when information is stored into a CAS system, the system will record a content address, which is an identifier based solely on the information itself. A request to retrieve information from a CAS system must provide the content identifier, from which the system can determine the physical location of the data and retrieve it. Because the identifiers are based on content, any change to a data element will necessarily change its content address. In some cases, a CAS device will not permit editing or deleting of information once it has been stored.

While the idea of content-addressed storage is not new, production-quality systems were not readily available until roughly 2003^[1]. In mid-2004, the industry group SNIA began working with a number of CAS providers to create standard behavior and interopability guidelines for CAS systems^[2].

[edit] Pros and Cons

CAS storage works most efficiently on data that does not change often. It is of particular interest to large organizations that must comply with document-retention laws, such as Sarbanes-Oxley. In these corporations a large volume of documents will be stored for as much as a decade, with no changes and infrequent access. CAS is designed to make the searching for a given document content very quick, and provides an assurance that the retrieved document is identical to the one originally stored. (If the documents were different, their content addresses would differ.) In addition, since data is stored into a CAS system by what it contains, there is never a situation where more than one copy of an identical document exists in storage. By definition, two identical documents have the same content address, and so point to the same storage location.

For data that changes frequently, CAS is not as efficient as location-based addressing. In these cases, the CAS device would need to continually recompute the address of data as it was changed, and the client systems would be forced to continually refresh their idea of where a given document exists. For random access systems, a CAS would also need to handle the possibility of two initially identical documents diverging, requiring a copy of one document to be created on demand.

[edit] Typical Implementation

The first commercially available CAS system, EMC's Centera platform^[3], is typical of a CAS implementation. The system consists of a series of networked nodes, divided between storage nodes and access nodes. The access nodes maintain a synchronized directory of content addresses, and the corresponding storage node where each address can be found. When a new data element, or blob (Binary_large_object), is added, the device calculates a hash of the content and returns this hash as the blob's content address.^[4] As mentioned above, the hash is looked to verify that identical content is not already present. If the content already exists, the device does not need to perform any additional steps; the content address already points to the proper content. Otherwise, the data is passed off to a storage node and written to the physical media.

When a content address is provided to the device, it first queries the directory for the physical location of the specified content address. The information is then retrieved from a storage node, and the actual hash of the data recomputed and verified. Once this is complete, the device can supply the requested data to the client. Within the Centera system, each content address actually represents a number of distinct data blobs, as well as optional metadata. Whenever a client adds an additional blob to an existing content block, the system recomputes the content address.

An other typical implementation is from iTernity. The concept of iTernity bases of container, each container is addressed by its hash value. A container is a multiple number of fixed content documents, so one container is not changeable and the hash value is fixed after the write process.