Legal and technical notes on telling web archives of domain ownership change

Web archives, libraries, and caches may have preserved content from your site, but they aren't tracking that it's your content. When your domain name is owned by its next owner, current practices around robots.txt mean a future owner can retroactively restrict access to your preserved content. πŸ˜•

Here's my personal work towards ameliorating this: a contract with a two-year crawlers clause, and attempts at a .well-known URI serving as a marker.

(This is not legal advice, I am not a lawyer, and I am not your lawyer. Contracts are legal devices, you should have counsel experienced in domain name transfers, IP rights, and also both your jurisdiction and the jurisdiction of the other party.)

I was asked if I'd consider selling a domain name. Specifics matter, so:

Based on these, I felt it was safe to consider selling the domain, as long as any archived personal and professional content wouldn't be hidden if and when the new owners change their robots.txt to something more restrictive.

In asking around, no-one I knew had ever seen a "robots.txt clause" or "archival/preservations terms" in a domain name transfer agreement. Neither my attorney nor the buyer's counsel had, either, and both were unclear on why one was necessary. Isn't my content, my content? Isn't any archival between me and the archive? I explained the need like this:

From a practical perspective, there are technical steps [the buyer] could take that would inadvertently interfere with the content that used to be on the domain, now or well after the purchase.

Third parties, like Google, the Internet Archive, national libraries, etc., crawl the domain and cache or archive its content, and have since before I acquired the domain in 2008. They do this automatically, without an explicit relationship with me. It's relatively safe for me to sell the domain and lose my original URLs, knowing that someone with an outdated link can look it up in those historical references.

The issue is that web crawlers don't track domain or content ownership, they just look at the domain itself. If [the buyer] blocks all caches, archives, or search engines from the domain, it will also retroactively make my past content unavailable in those same sources.

My attorney's original proposed clause looked like this:

Transferee shall not block caches, archives, crawlers or search engine access of any kind from the domain to allow for proper archiving of the new and previous websites. In the event that Transferor notifies Transferee of a violation of this subsection, Transferor agrees to correct the violation within seventy-two (72) hours of receiving notice of the violation for a period of 2 years after the execution of this Agreement.

This stipulated that they'd set up proper robots.txt permissions that would support not just my old content, but also their current content. Because of the limitations of how robots.txt files are structured, it may not be possible to preserve old URLs without also allowing for the preservation of current URLs. The 72-hour turnaround for repairs was only for the first two years, mostly because ICANN and/or courts may not like it when you have indefinite restrictions.

The final agreed-upon clause looked like this:

The Transferee shall not block caches, archives, crawlers or search engine access of any kind from the Domain Name to allow for proper archiving of content on the Domain Name. The sole remedy for the Transferor shall be that the Transferee will remedy any breach of this Clause 8.5 promptly on notice from the Transferor, generally within 72 hours, for the first two years after the date of this Agreement. For the avoidance of doubt, the Transferor may seek to enforce this remedy by means of injunction where appropriate.

From a technical perspective, the robots.txt-equivalent portion might actually be a little bit more liberal. It's made clear that I'm not looking to levy fines or damages in case of a blocked crawler, just that it should be fixed. This is also constrained to two years in total.

Technical notes

At Jonah Edwards' suggestion, I considered explicitly archiving a notice about the transfer of ownership. I settled on a well-known location that could serve as a line in the sand, signaling a change immediately before or after the presence of the URI. I wrote up a draft /.well-known/archival-transfer-marker proposal as a GitHub gist, solicited some feedback, and tried it out.

The actual notice preserved was:

This domain is transferring ownership in the near future. The current owners have held this domain since Fri Nov 07 05:52:26 2008, and recognize caches, archives, and libraries may have captured content from this domain during that time. The owners retain any and all rights in that preserved content, but grant caches, archives, and libraries a nonexclusive right to present said content as appropriate in their capacity as caches, archives, and/or libraries.

The draft proposal set out the following steps:

In practice, this didn't quite work out consistently. I tried this process with five archival tools:

Some archives don't archive 404s. Some archives don't archive robots.txt files. Some archives only maintain a "latest" crawl, so a recrawl of the deleted marker would destroy the contents. Some archives are private, and note the capture date and time, but won't display the archive itself to the public. Some archives had issues presenting the plain text, extensionless marker after capture.

As such, I can't recommend this exact process for others in the future, and have marked the draft as deprecated.

It seems like, for greatest compatibility, and only considering the domain scope, one instead wants to assert a minimally valid date range of ownership, with the key data in the filename itself.

Archival ownership markers

A /.well-known/archival-ownership-markers/ resource prefix is a well-known location meant to define the start, end, or range of dates of a unique owner of a domain name.

The capture of a /.well-known/archival-ownership-markers/-prefixed resource SHOULD NOT confer any new or different rights or permissions to the archiving party for existing captures. It also SHOULD NOT confer any intellectual property rights. Its capture only marks a potential change in archival policy for the domain's content, whether due to domain name ownership or other reasons.

A /.well-known/archival-ownership-markers/-prefixed resource MAY be empty; its existence is enough to indicate the boundary. The file MAY also include human-readable context for the boundary change, suitable to support the decision-making of web archives as to whether to continue to support the availability of content after e.g. a robots.txt policy change.

A /.well-known/archival-ownership-markers/-prefixed resource is named with a unique string of one or more filesystem-safe characters representing the owner, a double hyphen as delimiter, a range of valid RFC 3339 dates (inclusive), and a .txt extension, served with a text/plain MIME type:

A /.well-known/archival-ownership-markers/-prefixed resource with only delimiters and no dates (e.g. example----.txt) is not a valid resource name; a valid resource name must have at least one date.

The interpretation of /.well-known/archival-ownership-markers/-prefixed resources with overlapping dates, and which do not have clarifying information in their content bodies, is undefined.

The use of ISO 8601 EDTF uncertain/approximate qualifiers is not supported, as the date format here is the RFC 3339 profile, and the intent is to provide unambiguous dates for rigorous, legalistic decision-making: someone involved knows a valid date, either the transferor or the transferee.

Use cases

Without start date, with end date

To support web archives which use robots.txt as a policy for presenting archived content, a domain:

SHOULD:

  1. create /.well-known/archival-ownership-markers/example----2020-06-22.txt, and
  2. capture /.well-known/archival-ownership-markers/example----2020-06-22.txt in web archives.

This indicates that a change in robots.txt AFTER June 22, 2020 is not reflective of the policies of captures AT OR PRIOR TO June 22, 2020. As such, an archive MAY choose to use a previously captured robots.txt as policy for older captures instead.

With start date, with end date

To support web archives which can have human-designated exceptions for what are otherwise robots.txt-based policies, a domain:

MAY:

  1. create /.well-known/archival-ownership-markers/example--20081107T055226Z--20200622.txt with an explicit attestation such as:

    This domain is transferring ownership in the near future. The current owners have held this domain since Fri Nov 07 05:52:26 2008, and recognize caches, archives, and libraries may have captured content from this domain during that time. The owners retain any and all rights in that preserved content, but grant caches, archives, and libraries a nonexclusive right to present said content as appropriate in their capacity as caches, archives, and/or libraries.

  2. capture /.well-known/archival-ownership-markers/example--20081107T055226Z--20200622.txt in web archives.

This indicates that a change in robots.txt AFTER June 22, 2020 is not reflective of the policies of captures BETWEEN November 11, 2008 and June 22, 2020, inclusive. In addition, it formally notes the transfer of ownership. As such, an archive MAY choose to use a previously captured robots.txt as policy for older captures instead.

With start date, without end date

To support web archives which can have human-designated exceptions for what are otherwise robots.txt-based policies, an entity which has taken control of a domain due to expiration of hosting, registration, or other services, and which automatically "parks" the domain (placing generic, unrelated, or advertising content on it), MAY:

  1. create /.well-known/archival-ownership-markers/example--20200516T220338Z--.txt with an explicit attestation such as:

    This domain was automatically parked after a domain registration expiration. The expiration date was 2020-05-16T22:03:38+00:00.

  2. capture /.well-known/archival-ownership-markers/example--20200516T220338Z--.txt in web archives.

This indicates that a change in robots.txt AT OR AFTER May 16, 2020 is not reflective of the policies of captures PRIOR TO then. In addition, it formally notes the context of ownership change. As such, an archive MAY choose to use a previously captured robots.txt as policy for older captures instead.

Left as an exercise for the reader

With current robots.txt-based web archival practices, new owners will always be able to censor older content.

From a legal perspective, a contract would have to stipulate an indefinite and perpetual term for maintaining access by crawlers, as well as pass those terms along to the next buyer. This only works so long as there is a buyer; a domain expiration breaks the chain of obligation.

In addition, these contractual terms are only truly proven out in a court, upheld by a judge. My particular example is an international transfer, from a US individual to an Irish company. Our contract specified an Irish jurisdiction, but what states and countries would be particularly useful for terms like these? And what of arbitration?

A long-term lease could solve the legal problem, but it puts the onus of ownership and management on the original owner, instead of where it probably should be: on ICANN and its terms for domain registrations, and with archives and crawlers and how they choose to understand (or not understand) the content they're archiving.

Sources considered

Via Nancy Sims, Digitization of Special Collections and Archives: Legal and Contractual Issues, which has model deeds the example attestation was modeled after.

Via Jonah Edwards, The Oakland Archive Policy, for which this document helps address some of the cases in the "re-insertions of web sites based on change of ownership" category; Robots.txt Files and Archiving .gov and .mil Websites, for which this document helps address the example of "a domain name changes hands;" and Robots.txt meant for search engines don’t work well for web archives, for which this document helps address the example of "parked domains."


I'm Vitorio, it's June 22, 2020, thanks for your time. Updated March 16, 2021 with ISO 8601:2-2019 notes.