Keeping the Web Up Under the Weight of AI Crawlers

2 months 1 week ago

If you run a site on the open web, chances are you've noticed a big increase in traffic over the past few months, whether or not your site has been getting more viewers, and you're not alone. Operators everywhere have observed a drastic increase in automated traffic—bots—and in most cases attribute much or all of this new traffic to AI companies.

Background

AI—in particular, Large Language Models (LLMs) and generative AI (genAI)—rely on compiling as much information from relevant sources (i.e., "texts written in English" or "photographs") as possible in order to build a functional and persuasive model that users will later interact with. While AI companies in part distinguish themselves by what data their models are trained on, possibly the greatest source of information—one freely available to all of us—is the open web.

To gather up all that data, companies and researchers use automated programs called scrapers (sometimes referred to by the more general term "bots") to "crawl" over the links available between various webpages and save the types of information they're tasked with as they go. Scrapers are tools with a long, and often beneficial, history: services like search engines, the Internet Archive, and all kinds of scientific research rely on them.

When scrapers are not deployed thoughtfully, however, they can contribute to higher hosting costs, lower performance, and even site outages, particularly when site operators see so many of them in operation at the same time. In the long run all this may lead to some sites shutting down rather than bearing the brunt of it.

For-profit AI companies must ensure they do not poison the well of the open web they rely on in a short-sighted rush for training data.

Bots: Read the Room

There are existing best practices those who use scrapers should follow. When bots and their operators ignore these guideposts it sends a signal to site operators, sometimes explicitly, that they can or should cut off their access, impede performance, and in the worst case it may take a site down for all users. Some companies appear to follow these practices most of the time, but we see increasing reports and evidence of new bots that don't.

First, where possible, scrapers should follow instructions given in a site's robots.txt file, whether those are to back off to a certain crawling rate, exclude certain paths, or not to crawl the site at all.

Second, bots should send their requests with a clearly labeled User Agent string which indicates their operator, their purpose, and a means of contact.

Third, those running scrapers should provide a process for site operators to request back-offs, rate caps, exclusions, and to report problematic behavior via the means of contact info or response forms linked via the User Agent string.

Mitigations for Site Operators

Of course, if you're running a website dealing with a flood of crawling traffic, waiting for those bots to change their behavior for the better might not be realistic. Here are a few suggested, if imperfect, mitigations based in part on our own sometimes frustrating experiences.

First, use a caching layer. In most cases a Content Delivery Network (CDN) or an "edge platform" (essentially a newer iteration of a CDN) can provide this for you, and some services offer a free tier for non-commercial users. There are also a number of great projects if you prefer to self-host. Some of the tools we've used for caching include varnish, memcached, and redis.

Second, convert to static content to prevent resource-intensive database reads. In some cases this may reduce the need for caching.

Third, use targeted rate limiting to slow down bots without taking your whole site down. But know this can get difficult when scrapers try to disguise themselves with misleading User Agent strings or by spreading a fleet of crawlers out across many IP addresses.

Other mitigations such as client-side validation (e.g. CAPTCHAs or proof-of-work) and fingerprinting carry privacy and usability trade-offs, and we warn against deploying them without careful forethought.

Where Do We Go From Here?

To reiterate, whatever one's opinion of these particular AI tools, scraping itself is not the problem. Automated access is a fundamental technique of archivists, computer scientists, and everyday users that we hope is here to stay—as long as it can be done non-destructively. However, we realize that not all implementers will follow our suggestions for bots above, and that our mitigations are both technically advanced and incomplete.

Because we see so many bots operating for the same purpose at the same time, it seems there's an opportunity here to provide these automated data consumers with tailored data providers, removing the need for every AI company to scrape every website, seemingly, every day.

And on the operators' end, we hope to see more web-hosting and framework technology that is built with an awareness of these issues from day one, perhaps building in responses like just-in-time static content generation or dedicated endpoints for crawlers.

Starchy Grant

EFF to the FTC: DMCA Section 1201 Creates Anti-Competitive Regulatory Barriers

2 months 1 week ago

As part of multi-pronged effort towards deregulation, the Federal Trade Commission has asked the public to identify any and all “anti-competitive” regulations. Working with our friends at Authors Alliance, EFF answered, calling attention to a set of anti-competitive regulations that many don’t  recognize as such: the triennial exemptions to Section 1201 of the Digital Millennium Copyright Act, and the cumbersome process on which they depend.

Copyright grants exclusive rights to creators, but only as a means to serve the broader public interest. Fair use and other limitations play a critical role in that service by ensuring that the public can engage in commentary, research, education, innovation, and repair without unjustified restriction. Section 1201 effectively forbids fair uses where those uses require circumventing a software lock (a.k.a. technological protection measures) on a copyrighted work.

Congress realized that Section 1201 had this effect, so it adopted a safety valve—a triennial process by which the Library of Congress could grant exemptions. Under the current rulemaking framework, however, this intended safety valve functions more like a chokepoint. Individuals and organizations seeking an exemption to engage in lawful fair use must navigate a burdensome, time-consuming administrative maze. The existing procedural and regulatory barriers ensure that the rulemaking process—and Section 1201 itself—thwarts, rather than serves, the public interest.

The FTC does not, of course, control Congress or the Library of Congress. But we hope its investigation and any resulting report on anti-competitive regulations will recognize the negative effects of Section 1201 and that the triennial rulemaking process has failed to be the check Congress intended. Our comments urge the FTC to recommend that Congress repeal or reform Section 1201. At a minimum, the FTC should advocate for fundamental revisions to the Library of Congress’s next triennial rulemaking process, set for 2026, so that copyright law can once again fulfill its purpose: to support—rather than thwart—competitive and independent innovation.

You can find the full comments here.

Christopher Vines

Stepping outside the algorithm

2 months 1 week ago
Recent changes in how the big tech platforms X and Meta operate – specifically, stepping back from the responsibility of moderating disinformation online – have already negatively impacted the online…
Alan Finlay and Maja Romano

【映画の鏡】横浜市民の底力にスポット『The Spirit of Yokohama』市長選の年「街づくり」の在り方示す=鈴木賀津彦

2 months 1 week ago
  5月初旬に開催された横浜国際映画祭で、横浜の多様な市民活動とそのつながりを捉えたドキュメンタリー作品『The Spirit of Yokohama』が披露された。映画ファンが集い興行的にも注目される映画が数多く上映された中、違った意味で異彩を放った「究極の地域映画」として注目した。 横浜・元町で生まれ育ち、長年横浜の街づくりに関わってきた今年97歳の杉島和三郎さんにスポットを当てる。いわば市民活動の「つなぎ役」として、横浜の戦後復興でいかに市民の力が発揮されたかなどを説明..
JCJ

The Dangers of Consolidating All Government Information

2 months 1 week ago

The Trump administration has been heavily invested in consolidating all of the government’s information into a single searchable, or perhaps AI-queryable, super database. The compiling of all of this information is being done with the dubious justification of efficiency and modernization–however, in many cases, this information was originally siloed for important reasons: to protect your privacy, to prevent different branches of government from using sensitive data to punish or harass you, and to perserve the trust in and legitimacy of important civic institutions.

Attempts to Centralize All the Government’s Information About You

This process of consolidation has taken several forms. The purported Department of Government Efficiency (DOGE) has been seeking access to the data and computer systems of dozens of government agencies. According to one report, access to the data of these agencies has given DOGE, as of April 2025, hundreds of pieces of personal information about people living in the United States–everything ranging from financial and tax information, health and healthcare information, and even computer I.P. addresses. EFF is currently engaged in a lawsuit against the U.S. Office of Personnel Management (OPM) and DOGE for disclosing personal information about government employees to people who don’t need it in violation of the Privacy Act of 1974.

Another key maneuver in centralizing government information has been to steamroll the protections that were in place that keep this information away from agencies that don’t need, or could abuse, this information. This has been done by ignoring the law, like the Trump administration did when it ordered the IRS make tax information available for the purposes of immigration enforcement. It has also been done through the creation of new (and questionable) executive mandates that all executive branch information be made available to the White House or any other agency. Specifically, this has been attempted with the March 20, 2025 Executive Order, “Stopping Waste Fraud and Abuse by Eliminating Information Silos” which mandates that the federal government, as well as all 50 state governments, allow other agencies “full and prompt access to all unclassified agency records, data, software systems, and information technology systems.” But executive orders can’t override privacy laws passed by Congress.

Not only is the Trump administration trying to consolidate all of this data institutionally and statutorily, they are also trying to do it technologically. A new report revealed that the administration has contracted Palantir—the open-source surveillance and security data-analytics firm—to fuse data from multiple agencies, including the Department of Homeland Security and Health and Human Services.

Why it Matters and What Can Go Wrong 

The consolidation of government records equals more government power that can be abused. Different government agencies necessarily collect information to provide essential services or collect taxes. The danger comes when the government begins pooling that data and using it for reasons unrelated to the purpose it was collected.

Imagine, for instance, a scenario where a government employee could be denied health-related public services or support because of the information gathered about them by an agency that handles HR records. Or a person’s research topic according to federal grants being used to weigh whether or not that person should be allowed to renew a passport.

Marginalized groups are most vulnerable to this kind of abuse, including to locate individuals for immigration enforcement using tax records. Government records could also be weaponized against people who receive food subsidies, apply for student loans, or take government jobs

Congress recognized these dangers 50 years ago when it passed the Privacy Act to put strict limits on the government’s use of large databases. At that time, trust in the government eroded after revelations about White House enemies’ lists, misuse of existing government personality profiles, and surveillance of opposition political groups.

There’s another important issue at stake: the future of federal and state governments that actually have the information and capacity to help people. The more people learn to distrust the government because they worry the information they give certain government agencies may be used to hurt them in the future, the less likely people will be to participate or seek the help they need. The fewer people engage with these agencies, the less likely they will be to survive. Trust is a key part of any relationship between the governed and government and when that trust is abused or jettisoned, the long-term harms are irreparable.

EFF, like dozens of other organizations, will continue to fight to ensure personal records held by the government are only used and disclosed as needed and only for the purpose they were collected, as federal law demands. 

Related Cases: American Federation of Government Employees v. U.S. Office of Personnel Management
Matthew Guariglia

Judges Stand With Law Firms (and EFF) Against Trump’s Executive Orders

2 months 1 week ago

Pernicious.”

Unprecedented... cringe-worthy.”

Egregious.”

Shocking.” 

These are just some of the words that federal judges used in recent weeks to describe President Trump’s politically motivated and vindictive executive orders targeting law firms that have employed people or represented clients or causes he doesn’t like. 

But our favorite word by far is “unconstitutional.” 

EFF was one of the very first legal organizations to publicly come out in support of Perkins Coie when it became the first law firm to challenge the legality of President Trump’s executive order targeting it. Since then, EFF has joined four amicus briefs in support of targeted law firms, and in all four cases, judges from the U.S. District Court for the District of Columbia have indicated they’re having none of it. Three have issued permanent injunctions deeming the executive orders null and void, and the fourth seems to be headed in that same direction. 

Trump issued his EO against Perkins Coie on March 6. In a May 2 opinion finding the order unconstitutional and issuing a permanent injunction, Senior Judge Beryl A. Howell wrote:  

“By its terms, this Order stigmatizes and penalizes a particular law firm and its employees—from its partners to its associate attorneys, secretaries, and mailroom attendants—due to the Firm’s representation, both in the past and currently, of clients pursuing claims and taking positions with which the current President disagrees, as well as the Firm’s own speech,” Howell wrote. “In a cringe-worthy twist on the theatrical phrase ‘Let’s kill all the lawyers,’ EO 14230 takes the approach of ‘Let’s kill the lawyers I don’t like,’ sending the clear message: lawyers must stick to the party line, or else.” 

“Using the powers of the federal government to target lawyers for their representation of clients and avowed progressive employment policies in an overt attempt to suppress and punish certain viewpoints, … is contrary to the Constitution, which requires that the government respond to dissenting or unpopular speech or ideas with ‘tolerance, not coercion.’” 

 Trump issued a similar EO against Jenner & Block on March 25. In a May 23 opinion also finding the order unconstitutional and issuing a permanent injunction, Senior Judge John D. Bates wrote: 

“This order—which takes aim at the global law firm Jenner & Block—makes no bones about why it chose its target: it picked Jenner because of the causes Jenner champions, the clients Jenner represents, and a lawyer Jenner once employed. Going after law firms in this way is doubly violative of the Constitution. Most obviously, retaliating against firms for the views embodied in their legal work—and thereby seeking to muzzle them going forward—violates the First Amendment’s central command that government may not ‘use the power of the State to punish or suppress disfavored expression.’ Nat’l Rifle Ass’n of Am. v. Vullo, 602 U.S. 175, 188 (2024). More subtle but perhaps more pernicious is the message the order sends to the lawyers whose unalloyed advocacy protects against governmental viewpoint becoming government-imposed orthodoxy. This order, like the others, seeks to chill legal representation the administration doesn’t like, thereby insulating the Executive Branch from the judicial check fundamental to the separation of powers. It thus violates the Constitution and the Court will enjoin its operation in full.” 

 Trump issued his EO targeting WilmerHale on March 27. In a May 27 opinion finding that order unconstitutional, Senior Judge Richard J. Leon wrote: 

“The cornerstone of the American system of justice is an independent judiciary and an independent bar willing to tackle unpopular cases, however daunting. The Founding Fathers knew this! Accordingly, they took pains to enshrine in the Constitution certain rights that would serve as the foundation for that independence. Little wonder that in the nearly 250 years since the Constitution was adopted no Executive Order has been issued challenging these fundamental rights. Now, however, several Executive Orders have been issued directly challenging these rights and that independence. One of these Orders is the subject of this case. For the reasons set forth below, I have concluded that this Order must be struck down in its entirety as unconstitutional. Indeed, to rule otherwise would be unfaithful to the judgment and vision of the Founding Fathers!” 

“Taken together, the provisions constitute a staggering punishment for the firm’s protected speech! The Order is intended to, and does in fact, impede the firm’s ability to effectively represent its clients!” 

“Even if the Court found that each section could be grounded in Executive power, the directives set out in each section clearly exceed that power! The President, by issuing the Order, is wielding his authority to punish a law firm for engaging in litigation conduct the President personally disfavors. Thus, to the extent the President does have the power to limit access to federal buildings, suspend and revoke security clearances, dictate federal hiring, and manage federal contracts, the Order surpasses that authority and in fact usurps the Judiciary’s authority to resolve cases and sanction parties that come before the courts!” 

The fourth case in which EFF filed a brief involved Trump’s April 9 EO against Susman Godfrey. In that case, Judge Loren L. AliKhan is still considering whether to issue a permanent injunction, but on April 15 gave a fiery ruling from the bench in granting a temporary restraining order against the EO’s enforcement. 

“The executive order is based on a personal vendetta against a particular firm, and frankly, I think the framers of our Constitution would see this as a shocking abuse of power,” AliKhan said, as quoted by Courthouse News Service. "The government cannot hold lawyers hostage to force them to agree with it, allowing the government to coerce private business, law firms and lawyers solely on the basis of their view is antithetical to our constitutional republic and hampers this court, and every court’s, ability to adjudicate these cases.” 

And, as quoted by the New York Times: “Law firms across the country are entering into agreements with the government out of fear that they will be targeted next and that coercion is plain and simple. And while I wish other firms were not capitulating as readily, I admire firms like Susman for standing up and challenging it when it does threaten the very existence of their business. … The government has sought to use its immense power to dictate the positions that law firms may and may not take. The executive order seeks to control who law firms are allowed to represent. This immensely oppressive power threatens the very foundations of legal representation in our country.” 

As we wrote when we began filing amicus briefs in these cases, an independent legal profession is a cornerstone of democracy and the rule of law. As a nonprofit legal organization that frequently sues the federal government, EFF understands the value of this bedrock principle and how it–and First Amendment rights more broadly–are threatened by President Trump’s executive orders. It is especially important that the whole legal profession speak out against these actions, particularly in light of the silence or capitulation of a few large law firms. 

We’re glad the courts agree.

Josh Richman