200 OK
HTTP ErrorsThe standard success status — the request was handled and the body contains the expected content.
Glossary
Plain-English definitions for the concepts that show up when you build scrapers in production. Explore how they connect in the concept map, or search and browse the full A–Z list below.
151 terms · 407 connections
Showing 151 of 151 terms
The standard success status — the request was handled and the body contains the expected content.
A permanent redirect to a new URL; scrapers should follow it and update stored links.
A temporary redirect; the resource lives elsewhere for now, often used in login or anti-bot interstitials.
A cache validation response telling the client its stored copy is still fresh, so no body is sent.
The server rejected the request as malformed — bad parameters, headers or body.
Authentication is missing or invalid; commonly a wrong or absent API key.
The server understood the request but refuses it — frequently an anti-bot block on a flagged IP or fingerprint.
The requested resource does not exist at that URL; a signal to drop or fix a crawl target.
The server gave up waiting for the client to finish the request.
The client exceeded a rate limit; responses often include a Retry-After header indicating when to resume.
A generic server-side failure; usually transient and worth a retry.
An upstream server returned an invalid response to a gateway or proxy in the chain.
The server is overloaded or down for maintenance; also returned by anti-bot pages while they evaluate a client.
A gateway timed out waiting for an upstream server, common behind slow proxies or renders.
An async HTTP client/server framework for high-concurrency Python scraping over asyncio.
Fetching data in the background via JS without a full page reload, often the real source of a page's content.
An enterprise anti-bot platform using sensor JS and behavioral scoring to detect automated traffic.
A specific URL that accepts requests for one operation, e.g. a scrape endpoint that takes a target URL and returns the rendered page.
A secret token sent with each request to authenticate the caller, meter usage and bill the account.
A request submitted as a job that completes later, retrieved by polling or a webhook — useful for slow renders at scale.
A Python library for parsing HTML/XML and querying it with simple, forgiving selectors.
Scoring how a client moves, scrolls, types and times actions to tell humans from scripted automation.
Techniques that distinguish automated clients from humans using fingerprints, behavior and network signals.
A risk rating anti-bot systems assign each request; low scores pass, high scores get challenged or blocked.
A strategy that crawls all links at one depth before going deeper, good for broad coverage.
The access flows modern sites present to browsers — interactive checks and challenge pages — and how a genuine browser environment encounters them during normal page loading.
Driving a real browser programmatically to navigate, click and extract from pages that resist plain HTTP scraping.
An isolated browser profile (cookies, storage, cache) so parallel sessions do not leak state into each other.
The composite of JS-exposed properties — screen, fonts, plugins, timezone, languages — that identifies a browser instance.
A probe that verifies the browser exposes the full, consistent set of real APIs expected of a genuine user agent.
Running a target page inside a real (often headless) browser so JavaScript executes and the final DOM is captured — essential for SPAs.
A fingerprint from rendering hidden graphics to a <canvas>; subtle GPU/driver differences make the output device-specific.
A challenge designed to be easy for humans and hard for bots, gating access when a request looks automated.
An interstitial demanding a human challenge before the real page loads, often returned with 403 or 503.
Automatically clearing a CAPTCHA via solver services or token reuse so a flow can continue unattended.
The cookie Cloudflare issues after a passed challenge; reusing it (with matching IP/fingerprint) skips re-challenging.
The low-level interface (CDP) Puppeteer and Playwright use to instrument and control Chromium.
Simulating user clicks to expand sections, accept dialogs or trigger AJAX loads during a scrape.
Serving different content to bots than to humans, used both by sites to mislead scrapers and by spammers to fool crawlers.
A major CDN and security provider whose Bot Management issues JS, managed and Turnstile challenges to filter automation.
“Access denied” raised by a Cloudflare firewall (WAF) rule that the request violated — a common hard block.
Cloudflare's privacy-first challenge that validates users with passive signals instead of image puzzles.
The number of requests in flight at once; higher concurrency speeds throughput but risks rate limits and bans.
A gate that sets a clearance cookie via JS, which must be returned on the next request to access content.
Small key/value pairs the server sets and the client returns on later requests to maintain state such as login or anti-bot clearance.
A traversal run that fetches pages, extracts links and enqueues new URLs until a frontier or budget is exhausted.
The limited number of requests allotted to a crawl, balanced against rate limits, politeness and resources.
The unit consumed per successful scrape; pay-per-success billing charges credits only when a request returns usable data.
The styling language for web pages; its selector syntax is reused to target elements when scraping.
A pattern (e.g. `.price > span`) matching DOM elements, the primary way scrapers locate data.
Turning a fetched page into clean, structured records by selecting the relevant elements and discarding boilerplate.
A fast, cheap proxy hosted in a data center; easily fingerprinted by ASN and more readily blocked than residential IPs.
A real-time bot protection service that scores requests on fingerprint and behavior, serving captchas to suspects.
Restoring obfuscated code to a readable form to recover the algorithm behind a challenge or hidden API.
The Document Object Model — the in-memory tree of a page that JS mutates and scrapers traverse after rendering.
A retry strategy that grows the wait between attempts geometrically to avoid hammering a struggling server.
Collecting many client attributes (TLS, headers, JS APIs, hardware) into a near-unique signature used to identify and track clients.
Filling and submitting forms (search, login) programmatically to reach gated or query-driven content.
Routing a request through a proxy in a specific country or city to see the localized version of a site.
A privacy-oriented CAPTCHA alternative to reCAPTCHA, common on Cloudflare-protected sites.
A browser running without a visible window, used to render JS pages at scale on servers.
Spotting browsers run without a visible UI via tell-tale flags, missing features or timing quirks.
A hidden link or field invisible to humans but followed/filled by naive bots, flagging them instantly.
The markup language structuring web pages into elements that scrapers parse and select.
Building a navigable tree from raw HTML so elements can be queried by tag, class or selector.
Storing responses keyed by URL and validators (ETag, Last-Modified) so unchanged pages return 304 instead of a full body — affecting how fresh scraped data is.
Key/value metadata on a request or response — content type, cookies, user agent — that strongly influence how servers and anti-bot systems respond.
A message a client sends to a server specifying a method, URL, headers and optional body to ask for or submit data.
The server reply to a request, carrying a status code, headers and a body such as HTML or JSON.
A three-digit code in every response signalling outcome: 2xx success, 3xx redirect, 4xx client error, 5xx server error.
A multiplexed binary version of HTTP; its frame and header behavior is itself a fingerprintable signal.
A signature from HTTP/2 frame settings and header ordering that distinguishes real browsers from HTTP libraries.
A modern Python HTTP client supporting HTTP/2 and async, a more capable successor to Requests.
A WAF and bot-mitigation service that gates traffic with cookies, JS checks and reputation rules.
A pattern that loads more content as the user scrolls; automation must scroll repeatedly to harvest it all.
Blocking all traffic from an address after abuse signals; the main reason scrapers rotate proxies.
A score sites assign an IP based on past behavior and network type; low reputation triggers captchas or blocks.
A widely used hash of TLS ClientHello fields; mismatched JA3 vs. user agent is a classic bot tell.
A newer, more robust successor to JA3 for fingerprinting TLS clients, harder to spoof naively.
The language that runs in the browser to build pages dynamically, requiring rendering before data appears in the DOM.
An interstitial that runs JS the client must execute correctly to prove it is a real browser before the page loads.
JavaScript Object Notation — the lightweight text format used for request payloads and structured API responses.
Structured data embedded in pages (often in <script type="application/ld+json">) that scrapers can read directly.
An anti-bot system that ships obfuscated JS challenges and proof-of-work to raise the cost of automation.
Pulling href targets from a page to feed the frontier, normalizing relative and absolute URLs.
A fast C-backed Python library for parsing HTML/XML with full XPath support.
Cloudflare's adaptive interstitial that picks a challenge (JS, Turnstile) based on the request's risk score.
A proxy routed through a cellular carrier; shared CGNAT IPs make bans costly for sites, so they are highly trusted.
Recording cursor paths and timing; perfectly straight or absent movement betrays automation.
Directing the browser to URLs and following links/redirects while waiting for load events to settle.
A JS flag set true under automation; the first thing anti-bot scripts read to catch unpatched headless browsers.
Deliberately scrambling JS so anti-bot logic and tokens are hard to read or replicate.
Programmatic clicks, scrolls and key presses that drive a page through steps to reveal or load data.
Traversing multi-page result sets by following next-page links or incrementing page parameters until data is exhausted.
A Python data-analysis library used to clean, transform and export scraped data into tables and files.
A pricing model where only successful requests are billed, so blocked or failed attempts cost nothing.
Generating a PDF of a fully rendered page, preserving layout and styling for reports or archival.
A bot-defense platform (now HUMAN) relying on heavy client-side JS sensors and behavioral signals.
A cross-browser automation library (Chromium, Firefox, WebKit) with auto-waiting, contexts and network interception.
Python bindings for Playwright, driving Chromium, Firefox and WebKit with auto-waiting and stealth options.
Repeatedly checking a job endpoint until an asynchronous task reports completion.
Scrappey's primary endpoint: a POST call describing the target URL, command and options, returning the rendered result.
A computational puzzle the client must solve, taxing mass automation more than individual users.
A managed set of IPs the scraper draws from, balancing freshness, geography and reputation across requests.
An intermediary that forwards requests so the target sees the proxy IP, not yours — the backbone of large-scale scraping.
A Node library controlling Chromium over the DevTools Protocol for automation and rendering.
A cap on how many requests a client may make per time window; exceeding it typically returns HTTP 429.
A server defense capping requests per IP or token over time, returning 429s or blocks when exceeded.
Google's CAPTCHA with score-based (v3) and image-challenge (v2) modes that weigh behavior and reputation.
A sequence of redirects between the requested URL and the final page; long chains can hide cloaking or anti-bot gates.
The JSON body of a request that carries parameters such as the target URL, headers, proxy choice and browser commands.
The popular Python HTTP client for simple synchronous scraping of static pages and APIs.
A proxy whose IP belongs to a real ISP-assigned home connection, making traffic look like an ordinary user and harder to block.
An HTTP interface that exposes resources via predictable URLs and verbs (GET/POST), exchanging JSON payloads — the common shape of a scraping API.
Re-attempting failed requests, often with exponential backoff, to ride out transient errors and rate limits.
Analyzing a site's client code and network traffic to understand and replay its private APIs or anti-bot logic.
A file declaring which paths crawlers may or may not access; respecting it is core to ethical scraping.
A proxy that assigns a new IP per request or interval from a pool, spreading traffic to dodge rate limits and bans.
A pay-per-success web data API that combines headless browser rendering, rotating proxies and full browser session handling behind one request endpoint.
A batteries-included Python framework for building crawlers with spiders, pipelines and built-in concurrency.
A capability that returns a rendered image of a page, useful for visual QA, archiving or capturing content that resists text extraction.
The veteran browser-automation framework driving browsers via the WebDriver protocol across many languages.
Python bindings for Selenium WebDriver, automating real browsers for JS-heavy pages.
Encrypted telemetry payloads (Akamai/PerimeterX) generated client-side from device and behavior signals and sent for scoring.
A persisted browser/proxy context that keeps cookies, headers and IP stable across requests so multi-step flows stay authenticated.
A site that renders content client-side via JS, so the initial HTML is near-empty and needs a browser to scrape.
An XML file listing a site's URLs to help crawlers find pages efficiently without deep link discovery.
A 200 response that hides a block — a captcha page or empty shell served instead of real content.
Another name for a crawler, especially a Scrapy component defining how to follow links and parse responses.
A proxy mode that keeps the same IP for a set duration so multi-step, logged-in flows are not broken by rotation.
Data shaped into predictable fields and types (rows, JSON objects) ready for storage or analysis, as opposed to free-form HTML.
Deliberately slowing the request rate to stay under limits and avoid triggering anti-bot defenses.
A limit on how long a request may run before being aborted; common on slow renders or unresponsive targets.
A signature derived from the TLS ClientHello (cipher suites, extensions, order) that reveals the real client library behind any user agent.
The negotiation that establishes an encrypted connection; its ClientHello is what TLS fingerprinting inspects.
Filtering already-seen URLs (often via a hash set or Bloom filter) so the crawler does not refetch pages.
The prioritized queue of URLs waiting to be crawled, managing order, politeness and revisits.
A header string identifying the client browser and OS; mismatched or default values are an easy anti-bot tell.
Validating the UA string against TLS and JS signals; inconsistencies expose spoofed clients.
Pausing automation until a target element appears, ensuring async content has loaded before extraction.
A filter between client and app that blocks requests matching malicious or bot-like rule sets.
A program that systematically follows links to discover and fetch pages across a site or the web.
The automated extraction of data from websites by programmatically requesting pages and parsing their content into structured form.
A hosted service that fetches and renders target pages on your behalf, handling proxies, browsers and anti-bot evasion behind a single HTTP endpoint.
The W3C protocol and API standard for controlling browsers, the basis of Selenium.
Detecting automation frameworks by their injected properties and protocol artifacts.
A fingerprint from WebGL rendering and reported GPU strings, revealing the graphics stack of the client.
A callback URL the API posts results to when an asynchronous job finishes, removing the need to poll.
A persistent bidirectional connection used for live data; scraping it requires speaking the socket protocol.
Browser APIs that issue background HTTP calls; intercepting them often reveals a clean JSON API to scrape directly.
A query language for navigating XML/HTML trees, more expressive than CSS selectors for complex targeting.