Legal and technical notes on telling web archives of domain ownership change
Web archives, libraries, and caches may have preserved content from your site, but they aren't tracking that it's your content. When your domain name is owned by its next owner, current practices around robots.txt
mean a future owner can retroactively restrict access to your preserved content. π
Here's my personal work towards ameliorating this: a contract with a two-year crawlers clause, and attempts at a .well-known
URI serving as a marker.
Legal notes
(This is not legal advice, I am not a lawyer, and I am not your lawyer. Contracts are legal devices, you should have counsel experienced in domain name transfers, IP rights, and also both your jurisdiction and the jurisdiction of the other party.)
I was asked if I'd consider selling a domain name. Specifics matter, so:
- I had used the domain name in a personal and professional, but not commercial, capacity.
- I didn't need to retain ownership of the domain as a commercial asset.
- I wasn't actively using the domain for new content.
- Older URLs had mostly already been migrated or archived.
- There were redirects to externally-hosted content on the domain.
- Most redirects were already permanent redirects.
- I did not have email addresses attached to the domain.
- I didn't need to retain or constrain ownership of email addresses for security or privacy reasons.
- I had 100% of the content hosted on the domain backed up or already hosted elsewhere.
- There wasn't a risk of losing content, just URIs.
Based on these, I felt it was safe to consider selling the domain, as long as any archived personal and professional content wouldn't be hidden if and when the new owners change their robots.txt
to something more restrictive.
In asking around, no-one I knew had ever seen a "robots.txt
clause" or "archival/preservations terms" in a domain name transfer agreement. Neither my attorney nor the buyer's counsel had, either, and both were unclear on why one was necessary. Isn't my content, my content? Isn't any archival between me and the archive? I explained the need like this:
From a practical perspective, there are technical steps [the buyer] could take that would inadvertently interfere with the content that used to be on the domain, now or well after the purchase.
Third parties, like Google, the Internet Archive, national libraries, etc., crawl the domain and cache or archive its content, and have since before I acquired the domain in 2008. They do this automatically, without an explicit relationship with me. It's relatively safe for me to sell the domain and lose my original URLs, knowing that someone with an outdated link can look it up in those historical references.
The issue is that web crawlers don't track domain or content ownership, they just look at the domain itself. If [the buyer] blocks all caches, archives, or search engines from the domain, it will also retroactively make my past content unavailable in those same sources.
My attorney's original proposed clause looked like this:
Transferee shall not block caches, archives, crawlers or search engine access of any kind from the domain to allow for proper archiving of the new and previous websites. In the event that Transferor notifies Transferee of a violation of this subsection, Transferor agrees to correct the violation within seventy-two (72) hours of receiving notice of the violation for a period of 2 years after the execution of this Agreement.
This stipulated that they'd set up proper robots.txt
permissions that would support not just my old content, but also their current content. Because of the limitations of how robots.txt
files are structured, it may not be possible to preserve old URLs without also allowing for the preservation of current URLs. The 72-hour turnaround for repairs was only for the first two years, mostly because ICANN and/or courts may not like it when you have indefinite restrictions.
The final agreed-upon clause looked like this:
The Transferee shall not block caches, archives, crawlers or search engine access of any kind from the Domain Name to allow for proper archiving of content on the Domain Name. The sole remedy for the Transferor shall be that the Transferee will remedy any breach of this Clause 8.5 promptly on notice from the Transferor, generally within 72 hours, for the first two years after the date of this Agreement. For the avoidance of doubt, the Transferor may seek to enforce this remedy by means of injunction where appropriate.
From a technical perspective, the robots.txt
-equivalent portion might actually be a little bit more liberal. It's made clear that I'm not looking to levy fines or damages in case of a blocked crawler, just that it should be fixed. This is also constrained to two years in total.
Technical notes
At Jonah Edwards' suggestion, I considered explicitly archiving a notice about the transfer of ownership. I settled on a well-known location that could serve as a line in the sand, signaling a change immediately before or after the presence of the URI. I wrote up a draft /.well-known/archival-transfer-marker
proposal as a GitHub gist, solicited some feedback, and tried it out.
The actual notice preserved was:
This domain is transferring ownership in the near future. The current owners have held this domain since Fri Nov 07 05:52:26 2008, and recognize caches, archives, and libraries may have captured content from this domain during that time. The owners retain any and all rights in that preserved content, but grant caches, archives, and libraries a nonexclusive right to present said content as appropriate in their capacity as caches, archives, and/or libraries.
The draft proposal set out the following steps:
- Archive your
robots.txt
file - Create your
.well-known/archival-transfer-marker
file with or without contextual content inside it - Archive your
.well-known/archival-transfer-marker
file - Delete your
.well-known/archival-transfer-marker
file - Archive the deletion of the
.well-known/archival-transfer-marker
, because if the file persists, it doesn't serve its purpose as a temporal marker
In practice, this didn't quite work out consistently. I tried this process with five archival tools:
- Internet Archive Wayback Machine
- Perma.cc individual account
- Pinboard archiving account
- Archive.today
- Webrecorder Desktop
Some archives don't archive 404
s. Some archives don't archive robots.txt
files. Some archives only maintain a "latest" crawl, so a recrawl of the deleted marker would destroy the contents. Some archives are private, and note the capture date and time, but won't display the archive itself to the public. Some archives had issues presenting the plain text, extensionless marker after capture.
As such, I can't recommend this exact process for others in the future, and have marked the draft as deprecated.
It seems like, for greatest compatibility, and only considering the domain scope, one instead wants to assert a minimally valid date range of ownership, with the key data in the filename itself.
Archival ownership markers
A /.well-known/archival-ownership-markers/
resource prefix is a well-known location meant to define the start, end, or range of dates of a unique owner of a domain name.
The capture of a /.well-known/archival-ownership-markers/
-prefixed resource SHOULD NOT confer any new or different rights or permissions to the archiving party for existing captures. It also SHOULD NOT confer any intellectual property rights. Its capture only marks a potential change in archival policy for the domain's content, whether due to domain name ownership or other reasons.
A /.well-known/archival-ownership-markers/
-prefixed resource MAY be empty; its existence is enough to indicate the boundary. The file MAY also include human-readable context for the boundary change, suitable to support the decision-making of web archives as to whether to continue to support the availability of content after e.g. a robots.txt
policy change.
A /.well-known/archival-ownership-markers/
-prefixed resource is named with a unique string of one or more filesystem-safe characters representing the owner, a double hyphen as delimiter, a range of valid RFC 3339 dates (inclusive), and a .txt
extension, served with a text/plain
MIME type:
-
/.well-known/archival-ownership-markers/example----2020-06-22.txt
/.well-known/archival-ownership-markers/
: prefixexample
: arbitrary owner name--
: filesystem-safe ISO 8601 delimiter- (no start date)
--
: filesystem-safe ISO 8601 delimiter2020-06-22
: RFC 3339 date representing a reliable end date of control by the current owner (inclusive).txt
: file extension for archives that don't like extensionless resources
-
/.well-known/archival-ownership-markers/example--20081107T055226Z--20200622.txt
/.well-known/archival-ownership-markers/
: prefixexample
: arbitrary owner name--
: filesystem-safe ISO 8601 delimiter20081107T055226Z
: filesystem-safe (basic format) RFC 3339 date representing a reliable start date of control by the current owner (inclusive)--
: filesystem-safe ISO 8601 delimiter20200622
: basic format RFC 3339 date representing a reliable end date of control by the current owner (inclusive).txt
: file extension for archives that don't like extensionless resources
-
/.well-known/archival-ownership-markers/example--20200516T220338Z--.txt
/.well-known/archival-ownership-markers/
: prefixexample
: arbitrary owner name--
: filesystem-safe ISO 8601 delimiter20200516T220338Z
: filesystem-safe (basic format) RFC 3339 date representing a reliable start date of control by the current owner (inclusive)--
: filesystem-safe ISO 8601 delimiter- (no end date)
.txt
: file extension for archives that don't like extensionless resources
A /.well-known/archival-ownership-markers/
-prefixed resource with only delimiters and no dates (e.g. example----.txt
) is not a valid resource name; a valid resource name must have at least one date.
The interpretation of /.well-known/archival-ownership-markers/
-prefixed resources with overlapping dates, and which do not have clarifying information in their content bodies, is undefined.
The use of ISO 8601 EDTF uncertain/approximate qualifiers is not supported, as the date format here is the RFC 3339 profile, and the intent is to provide unambiguous dates for rigorous, legalistic decision-making: someone involved knows a valid date, either the transferor or the transferee.
Use cases
Without start date, with end date
To support web archives which use robots.txt
as a policy for presenting archived content, a domain:
- with a
robots.txt
which accurately reflects its desired archiving policy, and - for which a near-future ownership change may result in a change in
robots.txt
,
SHOULD:
- create
/.well-known/archival-ownership-markers/example----2020-06-22.txt
, and - capture
/.well-known/archival-ownership-markers/example----2020-06-22.txt
in web archives.
This indicates that a change in robots.txt
AFTER June 22, 2020 is not reflective of the policies of captures AT OR PRIOR TO June 22, 2020. As such, an archive MAY choose to use a previously captured robots.txt
as policy for older captures instead.
With start date, with end date
To support web archives which can have human-designated exceptions for what are otherwise robots.txt
-based policies, a domain:
- with a
robots.txt
which accurately reflects its desired archiving policy, and - for which a near-future ownership change may result in a change in
robots.txt
,
MAY:
-
create
/.well-known/archival-ownership-markers/example--20081107T055226Z--20200622.txt
with an explicit attestation such as:This domain is transferring ownership in the near future. The current owners have held this domain since Fri Nov 07 05:52:26 2008, and recognize caches, archives, and libraries may have captured content from this domain during that time. The owners retain any and all rights in that preserved content, but grant caches, archives, and libraries a nonexclusive right to present said content as appropriate in their capacity as caches, archives, and/or libraries.
-
capture
/.well-known/archival-ownership-markers/example--20081107T055226Z--20200622.txt
in web archives.
This indicates that a change in robots.txt
AFTER June 22, 2020 is not reflective of the policies of captures BETWEEN November 11, 2008 and June 22, 2020, inclusive. In addition, it formally notes the transfer of ownership. As such, an archive MAY choose to use a previously captured robots.txt
as policy for older captures
instead.
With start date, without end date
To support web archives which can have human-designated exceptions for what are otherwise robots.txt
-based policies, an entity which has taken control of a domain due to expiration of hosting, registration, or other services, and which automatically "parks" the domain (placing generic, unrelated, or advertising content on it), MAY:
-
create
/.well-known/archival-ownership-markers/example--20200516T220338Z--.txt
with an explicit attestation such as:This domain was automatically parked after a domain registration expiration. The expiration date was 2020-05-16T22:03:38+00:00.
-
capture
/.well-known/archival-ownership-markers/example--20200516T220338Z--.txt
in web archives.
This indicates that a change in robots.txt
AT OR AFTER May 16, 2020 is not reflective of the policies of captures PRIOR TO then. In addition, it formally notes the context of ownership change. As such, an archive MAY choose to use a previously captured robots.txt as policy for older captures instead.
Left as an exercise for the reader
With current robots.txt
-based web archival practices, new owners will always be able to censor older content.
From a legal perspective, a contract would have to stipulate an indefinite and perpetual term for maintaining access by crawlers, as well as pass those terms along to the next buyer. This only works so long as there is a buyer; a domain expiration breaks the chain of obligation.
In addition, these contractual terms are only truly proven out in a court, upheld by a judge. My particular example is an international transfer, from a US individual to an Irish company. Our contract specified an Irish jurisdiction, but what states and countries would be particularly useful for terms like these? And what of arbitration?
A long-term lease could solve the legal problem, but it puts the onus of ownership and management on the original owner, instead of where it probably should be: on ICANN and its terms for domain registrations, and with archives and crawlers and how they choose to understand (or not understand) the content they're archiving.
Sources considered
Via Nancy Sims, Digitization of Special Collections and Archives: Legal and Contractual Issues, which has model deeds the example attestation was modeled after.
Via Jonah Edwards, The Oakland Archive Policy, for which this document helps address some of the cases in the "re-insertions of web sites based on change of ownership" category; Robots.txt Files and Archiving .gov and .mil Websites, for which this document helps address the example of "a domain name changes hands;" and Robots.txt meant for search engines donβt work well for web archives, for which this document helps address the example of "parked domains."
I'm Vitorio, it's June 22, 2020, thanks for your time. Updated March 16, 2021 with ISO 8601:2-2019 notes.