This is the first draft of the structural design behind the new backup solution called b4f.
Currently the main goal is to have a system for encrypted off-site backup allowing for incremental backups.
This will be a slightly technical post, but remember that anything, even the goals, can change.
The goals in more detail:
- Encrypted – the backup might be stored on less trusted locations
- Effective storage – incremental backup
- Redundancy – have the backup stored on multiple locations
- Traffic efficiency – allow to share the data between storage servers to minimize the traffic to the local storage
- Local storage servers as well as remote, friends computers, removable disks or purchased networked storage
- Share data among users, once the backup is already on a friends computer, sharing it will be much faster
- Data instead of connection focused – the “protocol” should work offline without direct connection between server and client.
Server – Client design overview
The design will be separated into two parts. The local client and the storage server.
The client will handle the filesystem, determine what to backup, preparing the data to be backed-up, keeping the cryptographic keys as well as deciding how the backups should be distributed.
The server will not work with files as the client, instead it will work with smaller units of data, called chunks, that the client prepares. The main task for the server will be to receive and store these units as well as distribute them when requested by the client.
Therefore the communication between the server and client is only about chunks, no files, therefore the system might be used to store other kind of data other than files.
Local file-system
This is partly specific to the client. Other implementations might function differently and still be able to work with the same server.
One or more folders are selected to be backed-up. These will be indexed into a traditional file tree. Each file will be stored in a specific file structure. This tree with all its file meta-data will be combined into a snapshot. This snapshot will itself be in the form of a chunk that will be described later.
The client can further have other features such as automatic monitoring and backup.
File storage
A file will have the most basic attributes such as name and change date.
The more specific part here is that the file will be split into several chunks. These will be the fundamental storage unit in the final backup system.
Thus the file structure only contain the ID to fnd these chunks, not their data.
Chunk
Each chunk is a piece of data from the file that is encrypted before being sent to the remote backup.
There is two parts identifying the data. First the ID that is an encrypted hash of the cleartext data. The second part is a hash of the encrypted data. There is also an encrypted key that follows each chunk that is used to decrypt the data.
The ID and decryption key comes in pairs for each chunk. Every new pair represents a new local key-change or a new user that is given access to the data.
The shared ID principle is used to make it possible for a single server to identify identical chunks of data that does not have to be transferred. One problem with current design is that M might upload false data with a specific chunk id that correlates to some data that A might later upload. When A is going to upload the correct data the server will believe it already has got the data and stop the upload. Thus the true data will not be backed-up. So far there is no way for the server to verify the ID hash since it does not have access to the data. Comments on this is very appreciated.
Order
Since the design will work with both online and offline transports we introduce the order.
An order is a document signed by one user. The document states what chunks a specific server will store or release.
The figure is currenlty missing target server to which the order applies.
Once a server receives an order it will start filling up with the chunks. They can come from the client computer or another server. Since the order verifies the data there is no need to further verify from where the data comes. Therefore an order with its chunks payload may be transferred on a removable disk and later uploaded to the server from an insecure machine.
The main operation can be one of the following:
- Put – store data on the server
- Get – retrieve data from the server
- Update – add new keys and ID:s to existing chunks
Some additional conditions could be included in the order such as expire date.
In the end the order contain a list of Chunk ID:s that indicate what chunks this order affects.
Put
A put order authorizes that the server may store the listed chunks on the server. The server may get these from the users local machine but could also retrieve them from other servers.
Get
A get order authorized retrieval of data. This indicates that one user or server is allowed to retrieve the chunks from that server.
Although you must have the private key to decrypt the data, it is still of interest to limit ones access to this data.
Update
This order updates the chunks with more pairs of key and ID from new users.
Key handling
We have not been specific in this first draft about what keys there are, but we still have a few thoughts.
The system will be setup so that a secret key for decryption can be stored elsewhere. This key will be the only one necessary to restore the backup from a new computer.
There could be only one pair of asymmetric keys for both encryption and hash signing, or they could be two different.
Similar protocols
The previous post Inspiration for swarm storage, was the initial inspiration for this design. Some of the goals from that post have been met.
So far the server has been described as an active one with a specific software that verifies signatures and minimize the traffic. Another option is still to use an existing service such as a ftp account. This would then need ftp support into the client.
Furthermore the snapshot might have more similarities with the bittorrent protocol and could thus be modified to be able to use existing trackers to initiate communication between client and servers.






