Skip to main content
The feed
ENGINEERING2026.06.116 min read

Scraping Is an Arms Race You Win With Discipline

Clever bypasses win a week. Disciplined extraction wins years. On rate ceilings, backoff, observability, and the boring engineering that survives the arms race.

Hamad Pervaiz
Hamad Pervaiz
Founder & CEO, BearPlex
Share

Every scraper dies the same way. Not with an error. With a page that returns 200, looks perfectly healthy, and contains nothing you wanted.

The block usually landed weeks earlier. The team just was not watching.

Extraction at scale has a reputation as a dark art: rotating proxies, headless browsers, cat-and-mouse tricks traded in forum threads. The reputation is earned, but it points at the wrong skill. The hard part of scraping has never been getting one page. The hard part is getting page ten million, on schedule, six months from now, after the target has redesigned twice and swapped anti-bot vendors once. That is an operations problem. And operations problems are won with discipline.

The Lucky Run

Most extraction projects are engineered for the lucky run. Someone finds a bypass, the crawl works, the demo lands, and the team scales it 50x overnight, because nothing inflates faster than confidence in a working demo. Two weeks later the data stops, and nobody can say exactly when, because nobody was measuring.

I think of it as a simple split. The lucky run is when your scraper works because nobody has noticed it yet. The long run is when your scraper works because it was built to be tolerable. Those are different engineering goals, and they produce different systems.

The arms-race framing matters here. On the other side of your scraper sits an anti-bot vendor with a funded roadmap, a detection team, and telemetry from thousands of sites. Your one clever trick is competing against their quarterly release cycle. You will not out-clever a roadmap. You can out-discipline one, because discipline does not expire when detection improves. A system built on respect for rate ceilings, graceful degradation, and honest measurement keeps working through the detection upgrades that wipe out the trick-based crawlers around it.

The Ceiling

Every site has a tolerance: a request rate at which you are background noise, and a request rate at which you are a problem worth solving. Disciplined extraction starts by finding that ceiling and then staying politely under it, permanently, even when you could go faster.

This sounds slow. It is the opposite.

Backoff is the clearest example. Most teams implement backoff as an apology, something the scraper does when it gets caught. We treat it as a feature with a spec. Exponential delays. Jitter, so a thousand workers do not retry in lockstep. Circuit breakers that pull an entire domain out of rotation the moment error rates twitch. A scraper that knows how to slow down is a scraper that gets to keep going. A scraper that only knows how to push gets remembered, fingerprinted, and banned at the network level, taking your future options with it.

Fingerprint realism is the same idea from another angle. Amateurs fake a user agent string and call it stealth. Real detection looks at everything at once: TLS handshake, header order, timing rhythm, session depth, whether your supposed browser ever loads a stylesheet. You do not beat that by lying harder. You beat it by actually behaving like the thing you claim to be, which mostly means behaving like someone patient. No human reads 400 product pages a second. Not even my engineers, and I hired 65 of the most caffeinated ones I could find.

And when a block comes anyway, the discipline is to read it instead of fighting it. A 429 is the site telling you exactly where its line is. A new challenge page is advance notice that the defense posture changed. Teams that treat blocks as obstacles rotate IPs and stomp harder, burning trust they cannot rebuild. Teams that treat blocks as signals adjust the same day and are still collecting a year later. The block is free intelligence. Most people throw it away.

The Instruments

The most dangerous scraper failure makes no noise at all. The site redesigns, your selector matches an empty div, and your pipeline keeps writing rows. Every status code says 200. Every dashboard stays green. Three weeks later an analyst asks why prices stopped changing, and you discover you have been collecting beautifully structured nothing since the 14th.

This is why I insist on observability on every fetch, with no exceptions for the boring ones. Each request should leave a record: status, latency, response size, parse yield. Each extracted record should pass through quality gates before it is allowed to exist:

  • Field fill rates, tracked over time and alarmed on dips
  • Value distributions, so a price column that flatlines to zero screams
  • Schema checks that catch a redesign within minutes, not weeks
  • Canary pages with known content, fetched on a schedule and compared against truth

The principle underneath: treat data quality like uptime. An HTTP 500 is the good kind of failure, loud and pageable. A 200 full of garbage is the kind that quietly poisons every decision built downstream. If your monitoring only watches transport and never watches meaning, you are watching the half of the system that was never the risk.

The Line

Discipline also means knowing when not to collect, and this is where I will plant a flag: the strongest scraping teams are defined by what they refuse.

Public data only. If it sits behind a login, a paywall, or a personal privacy setting, the answer is no. Read the terms of the sites you touch and treat them as an input to system design, not an obstacle to route around. Strip personal data you do not need at the point of ingestion, because every field you collect is a field you now have to secure, monitor, and answer for. Data minimization usually gets framed as compliance. I frame it as engineering hygiene: smaller surface, fewer liabilities, cheaper pipeline.

These lines are not decoration on top of the engineering. They are part of why the engineering survives. A collector that respects the explicit and implicit rules of its targets attracts less hostility, draws fewer escalations, and gives the lawyers on both sides nothing to do. The teams that ignore the lines win sprints and lose the race. When a project requires crossing them, the right answer is to walk away. We say no often, and I am proud of it. Refusal is a capability. It compounds.

The Long Run

None of this is theoretical for us. The Letti AI engagement was large-scale data extraction, the kind of work where the lucky run is worthless because the system has to keep producing long after launch week. Work at that scale is why I hold these convictions as hard as I do. Not exotic bypasses. The boring parts, treated as the product: rate discipline, degradation paths, instruments on every fetch, quality gates on every record.

And extraction is only the front door. The data that comes through it still has to be validated, deduplicated, versioned, and delivered somewhere it can earn its keep, which is why we treat scraping as the first stage of a data pipeline rather than a standalone stunt. A perfect crawl feeding a sloppy pipeline produces the same outcome as a blocked crawl, just with a bigger storage bill.

Here is the test I give any extraction system, ours included. Imagine the target hires its dream anti-bot team tomorrow. Does your system degrade gracefully, alert honestly, and keep delivering what it still can while you adapt? Or does it go dark and take your roadmap with it?

The arms race is real, and it never ends. That is exactly why it rewards the patient. Cleverness peaks on day one. Discipline compounds every day after.

And the page that returns 200 with nothing inside? With the right instruments, it pages you in four minutes instead of poisoning you for three weeks. That difference is the whole game.

Filed under engineering · 2026.06.11
Share
From reading to building

If this maps to a decision you are making, talk to us.

The systems described in the feed are the systems we ship. The first conversation is with an engineer, not an account manager.