Glossary

Web Scraping Glossary

Plain-English definitions for the concepts that show up when you build scrapers in production. Explore how they connect in the concept map, or search and browse the full A–Z list below.

151 terms · 407 connections

Hover · drag · click a term

Browse all terms

Showing 151 of 151 terms

200 OK

HTTP Errors

The standard success status — the request was handled and the body contains the expected content.

Related:HTTP Status Code HTTP Response

301 Moved Permanently

HTTP Errors

A permanent redirect to a new URL; scrapers should follow it and update stored links.

Related:HTTP Status Code 302 Found Redirect Chain

302 Found

HTTP Errors

A temporary redirect; the resource lives elsewhere for now, often used in login or anti-bot interstitials.

Related:HTTP Status Code 301 Moved Permanently Redirect Chain JavaScript Challenge

304 Not Modified

HTTP Errors

A cache validation response telling the client its stored copy is still fresh, so no body is sent.

Related:HTTP Status Code HTTP Caching

400 Bad Request

HTTP Errors

The server rejected the request as malformed — bad parameters, headers or body.

Related:HTTP Status Code Request Payload HTTP Headers

401 Unauthorized

HTTP Errors

Authentication is missing or invalid; commonly a wrong or absent API key.

Related:HTTP Status Code API Key 403 Forbidden

403 Forbidden

HTTP Errors

The server understood the request but refuses it — frequently an anti-bot block on a flagged IP or fingerprint.

Related:HTTP Status Code IP Ban Bot Detection Cloudflare Web Application Firewall 401 Unauthorized

404 Not Found

HTTP Errors

The requested resource does not exist at that URL; a signal to drop or fix a crawl target.

Related:HTTP Status Code Crawl Link Extraction

408 Request Timeout

HTTP Errors

The server gave up waiting for the client to finish the request.

Related:HTTP Status Code Timeout 504 Gateway Timeout

429 Too Many Requests

HTTP Errors

The client exceeded a rate limit; responses often include a Retry-After header indicating when to resume.

Related:HTTP Status Code Rate Limit Rate Limiting (defense)Retry Logic Exponential Backoff Throttling

500 Internal Server Error

HTTP Errors

A generic server-side failure; usually transient and worth a retry.

Related:HTTP Status Code Retry Logic 502 Bad Gateway

502 Bad Gateway

HTTP Errors

An upstream server returned an invalid response to a gateway or proxy in the chain.

Related:HTTP Status Code 503 Service Unavailable Proxy Server Retry Logic

503 Service Unavailable

HTTP Errors

The server is overloaded or down for maintenance; also returned by anti-bot pages while they evaluate a client.

Related:HTTP Status Code Retry Logic Cloudflare JavaScript Challenge 502 Bad Gateway

504 Gateway Timeout

HTTP Errors

A gateway timed out waiting for an upstream server, common behind slow proxies or renders.

Related:HTTP Status Code Timeout 408 Request Timeout Proxy Server

aiohttp

Python Web Scraping

An async HTTP client/server framework for high-concurrency Python scraping over asyncio.

Related:HTTPX Concurrency Asynchronous Request Scrapy

AJAX

Web Technologies

Fetching data in the background via JS without a full page reload, often the real source of a page's content.

Related:XHR / Fetch JavaScript JSON Single-Page Application Reverse Engineering

Akamai Bot Manager

Challenge Handling

An enterprise anti-bot platform using sensor JS and behavioral scoring to detect automated traffic.

Related:Bot Detection Behavioral Analysis Fingerprinting Sensor Data

API Endpoint

Web Scraping APIs

A specific URL that accepts requests for one operation, e.g. a scrape endpoint that takes a target URL and returns the rendered page.

Related:REST API POST /request HTTP Request API Key

API Key

Web Scraping APIs

A secret token sent with each request to authenticate the caller, meter usage and bill the account.

Related:API Endpoint HTTP Request Rate Limit Credits 401 Unauthorized

Asynchronous Request

Web Scraping APIs

A request submitted as a job that completes later, retrieved by polling or a webhook — useful for slow renders at scale.

Related:Polling Webhook Concurrency aiohttp

BeautifulSoup

Python Web Scraping

A Python library for parsing HTML/XML and querying it with simple, forgiving selectors.

Related:lxml HTML Parsing CSS Selector Requests (Python)Data Extraction

Behavioral Analysis

Challenge Handling

Scoring how a client moves, scrolls, types and times actions to tell humans from scripted automation.

Related:Mouse Movement Tracking Bot Score reCAPTCHA Akamai Bot Manager PerimeterX (HUMAN)

Bot Detection

Challenge Handling

Techniques that distinguish automated clients from humans using fingerprints, behavior and network signals.

Related:Fingerprinting Behavioral Analysis Headless Detection Bot Score Browser Access Challenges 403 Forbidden

Bot Score

Challenge Handling

A risk rating anti-bot systems assign each request; low scores pass, high scores get challenged or blocked.

Related:Bot Detection Behavioral Analysis IP Reputation Managed Challenge DataDome

Breadth-First Crawl

Crawling

A strategy that crawls all links at one depth before going deeper, good for broad coverage.

Related:URL Frontier Crawl Web Crawler

Browser Access Challenges

Challenge Handling

The access flows modern sites present to browsers — interactive checks and challenge pages — and how a genuine browser environment encounters them during normal page loading.

Related:Bot Detection TLS Fingerprint CAPTCHA Solving Cloudflare Web Scraping API

Browser Automation

Web Automation

Driving a real browser programmatically to navigate, click and extract from pages that resist plain HTTP scraping.

Related:Headless Browser Puppeteer Playwright Selenium Page Interaction

Browser Context

Web Automation

An isolated browser profile (cookies, storage, cache) so parallel sessions do not leak state into each other.

Related:Session Cookies Playwright Sticky Session

Browser Fingerprint

Challenge Handling

The composite of JS-exposed properties — screen, fonts, plugins, timezone, languages — that identifies a browser instance.

Related:Fingerprinting Canvas Fingerprint WebGL Fingerprint navigator.webdriver User-Agent

Browser Integrity Check

Challenge Handling

A probe that verifies the browser exposes the full, consistent set of real APIs expected of a genuine user agent.

Related:Browser Fingerprint JavaScript Challenge Headless Detection Cloudflare

Browser Rendering

Web Scraping APIs

Running a target page inside a real (often headless) browser so JavaScript executes and the final DOM is captured — essential for SPAs.

Related:Headless Browser JavaScript Single-Page Application DOM Screenshot API Web Scraping API

Canvas Fingerprint

Challenge Handling

A fingerprint from rendering hidden graphics to a <canvas>; subtle GPU/driver differences make the output device-specific.

Related:Fingerprinting WebGL Fingerprint Browser Fingerprint

CAPTCHA

Challenge Handling

A challenge designed to be easy for humans and hard for bots, gating access when a request looks automated.

Related:reCAPTCHA hCaptcha Cloudflare Turnstile CAPTCHA Solving CAPTCHA Challenge Page

CAPTCHA Challenge Page

HTTP Errors

An interstitial demanding a human challenge before the real page loads, often returned with 403 or 503.

Related:CAPTCHA Soft Block 403 Forbidden JavaScript Challenge Cloudflare Turnstile

CAPTCHA Solving

Challenge Handling

Automatically clearing a CAPTCHA via solver services or token reuse so a flow can continue unattended.

Related:CAPTCHA reCAPTCHA hCaptcha Cloudflare Turnstile Browser Access Challenges

cf_clearance Cookie

Challenge Handling

The cookie Cloudflare issues after a passed challenge; reusing it (with matching IP/fingerprint) skips re-challenging.

Related:Cloudflare Cookie Challenge Cookies Session

Chrome DevTools Protocol

Web Automation

The low-level interface (CDP) Puppeteer and Playwright use to instrument and control Chromium.

Related:Puppeteer Playwright WebDriver Headless Browser

Click Automation

Web Automation

Simulating user clicks to expand sections, accept dialogs or trigger AJAX loads during a scrape.

Related:Page Interaction Form Submission Wait for Selector AJAX

Cloaking

Challenge Handling

Serving different content to bots than to humans, used both by sites to mislead scrapers and by spammers to fool crawlers.

Related:Soft Block Redirect Chain User-Agent Analysis Honeypot

Cloudflare

Challenge Handling

A major CDN and security provider whose Bot Management issues JS, managed and Turnstile challenges to filter automation.

Related:Cloudflare Turnstile JavaScript Challenge Managed Challenge cf_clearance Cookie Cloudflare Error 1020 Web Application Firewall

Cloudflare Error 1020

HTTP Errors

“Access denied” raised by a Cloudflare firewall (WAF) rule that the request violated — a common hard block.

Related:Cloudflare Web Application Firewall 403 Forbidden IP Ban

Cloudflare Turnstile

Challenge Handling

Cloudflare's privacy-first challenge that validates users with passive signals instead of image puzzles.

Related:Cloudflare CAPTCHA Managed Challenge CAPTCHA Solving

Concurrency

Web Scraping APIs

The number of requests in flight at once; higher concurrency speeds throughput but risks rate limits and bans.

Related:Rate Limit Throttling aiohttp Asynchronous Request Crawl Budget

Cookies

Web Scraping APIs

Small key/value pairs the server sets and the client returns on later requests to maintain state such as login or anti-bot clearance.

Related:Session HTTP Headers Cookie Challenge cf_clearance Cookie

Crawl

Crawling

A traversal run that fetches pages, extracts links and enqueues new URLs until a frontier or budget is exhausted.

Related:Web Crawler Crawl Budget URL Frontier Pagination 404 Not Found

Crawl Budget

Crawling

The limited number of requests allotted to a crawl, balanced against rate limits, politeness and resources.

Related:Crawl Concurrency Throttling robots.txt

Credits

Web Scraping APIs

The unit consumed per successful scrape; pay-per-success billing charges credits only when a request returns usable data.

Related:Pay Per Success API Key Rate Limit

CSS

Web Technologies

The styling language for web pages; its selector syntax is reused to target elements when scraping.

Related:HTML CSS Selector DOM

CSS Selector

Web Technologies

A pattern (e.g. `.price > span`) matching DOM elements, the primary way scrapers locate data.

Related:CSS XPath DOM HTML Parsing Data Extraction

Data Extraction

Web Scraping APIs

Turning a fetched page into clean, structured records by selecting the relevant elements and discarding boilerplate.

Related:HTML Parsing Structured Data CSS Selector XPath JSON

Datacenter Proxy

Proxies

A fast, cheap proxy hosted in a data center; easily fingerprinted by ASN and more readily blocked than residential IPs.

Related:Proxy Server Residential Proxy IP Reputation IP Ban

DataDome

Challenge Handling

A real-time bot protection service that scores requests on fingerprint and behavior, serving captchas to suspects.

Related:Bot Detection Fingerprinting CAPTCHA Bot Score

Deobfuscation

Reverse Engineering

Restoring obfuscated code to a readable form to recover the algorithm behind a challenge or hidden API.

Related:Obfuscation Reverse Engineering JavaScript

DOM

Web Technologies

The Document Object Model — the in-memory tree of a page that JS mutates and scrapers traverse after rendering.

Related:HTML JavaScript Browser Rendering CSS Selector Single-Page Application

Exponential Backoff

Web Scraping APIs

A retry strategy that grows the wait between attempts geometrically to avoid hammering a struggling server.

Related:Retry Logic Rate Limit 429 Too Many Requests

Fingerprinting

Challenge Handling

Collecting many client attributes (TLS, headers, JS APIs, hardware) into a near-unique signature used to identify and track clients.

Related:Browser Fingerprint TLS Fingerprint Canvas Fingerprint WebGL Fingerprint Bot Detection

Form Submission

Web Automation

Filling and submitting forms (search, login) programmatically to reach gated or query-driven content.

Related:Page Interaction Click Automation Session Cookies

Geotargeting

Web Scraping APIs

Routing a request through a proxy in a specific country or city to see the localized version of a site.

Related:Proxy Server Residential Proxy Mobile Proxy Proxy Pool

hCaptcha

Challenge Handling

A privacy-oriented CAPTCHA alternative to reCAPTCHA, common on Cloudflare-protected sites.

Related:CAPTCHA reCAPTCHA CAPTCHA Solving Cloudflare

Headless Browser

Web Automation

A browser running without a visible window, used to render JS pages at scale on servers.

Related:Browser Rendering Browser Automation Headless Detection Puppeteer Chrome DevTools Protocol

Headless Detection

Challenge Handling

Spotting browsers run without a visible UI via tell-tale flags, missing features or timing quirks.

Related:Headless Browser navigator.webdriver WebDriver Detection Browser Fingerprint

Honeypot

Challenge Handling

A hidden link or field invisible to humans but followed/filled by naive bots, flagging them instantly.

Related:Bot Detection Link Extraction CSS Selector Cloaking

HTML

Web Technologies

The markup language structuring web pages into elements that scrapers parse and select.

Related:DOM CSS HTML Parsing CSS Selector

HTML Parsing

Web Scraping APIs

Building a navigable tree from raw HTML so elements can be queried by tag, class or selector.

Related:HTML DOM CSS Selector XPath BeautifulSoup Data Extraction

HTTP Caching

Web Technologies

Storing responses keyed by URL and validators (ETag, Last-Modified) so unchanged pages return 304 instead of a full body — affecting how fresh scraped data is.

Related:304 Not Modified HTTP Headers HTTP Status Code

HTTP Headers

Web Scraping APIs

Key/value metadata on a request or response — content type, cookies, user agent — that strongly influence how servers and anti-bot systems respond.

Related:HTTP Request HTTP Response User-Agent Cookies TLS Fingerprint

HTTP Request

Web Scraping APIs

A message a client sends to a server specifying a method, URL, headers and optional body to ask for or submit data.

Related:HTTP Response HTTP Headers User-Agent POST /request HTTP Status Code

HTTP Response

Web Scraping APIs

The server reply to a request, carrying a status code, headers and a body such as HTML or JSON.

Related:HTTP Request HTTP Status Code JSON HTML HTTP Headers

HTTP Status Code

HTTP Errors

A three-digit code in every response signalling outcome: 2xx success, 3xx redirect, 4xx client error, 5xx server error.

Related:HTTP Response HTTP Request 200 OK 403 Forbidden 429 Too Many Requests 503 Service Unavailable

HTTP/2

Web Technologies

A multiplexed binary version of HTTP; its frame and header behavior is itself a fingerprintable signal.

Related:HTTP/2 Fingerprint HTTPX HTTP Headers

HTTP/2 Fingerprint

Challenge Handling

A signature from HTTP/2 frame settings and header ordering that distinguishes real browsers from HTTP libraries.

Related:TLS Fingerprint JA4 HTTP Headers HTTP/2

HTTPX

Python Web Scraping

A modern Python HTTP client supporting HTTP/2 and async, a more capable successor to Requests.

Related:Requests (Python)aiohttp HTTP/2 Asynchronous Request

Imperva / Incapsula

Challenge Handling

A WAF and bot-mitigation service that gates traffic with cookies, JS checks and reputation rules.

Related:Web Application Firewall Cookie Challenge Bot Detection IP Reputation

Infinite Scroll

Web Automation

A pattern that loads more content as the user scrolls; automation must scroll repeatedly to harvest it all.

Related:Page Interaction Single-Page Application AJAX Pagination

IP Ban

Challenge Handling

Blocking all traffic from an address after abuse signals; the main reason scrapers rotate proxies.

Related:Rotating Proxy IP Reputation 403 Forbidden Rate Limiting (defense)Proxy Server

IP Reputation

Proxies

A score sites assign an IP based on past behavior and network type; low reputation triggers captchas or blocks.

Related:Residential Proxy Datacenter Proxy IP Ban Bot Score Proxy Pool

JA3

Challenge Handling

A widely used hash of TLS ClientHello fields; mismatched JA3 vs. user agent is a classic bot tell.

Related:TLS Fingerprint JA4 TLS Handshake

JA4

Challenge Handling

A newer, more robust successor to JA3 for fingerprinting TLS clients, harder to spoof naively.

Related:JA3 TLS Fingerprint HTTP/2 Fingerprint

JavaScript

Web Technologies

The language that runs in the browser to build pages dynamically, requiring rendering before data appears in the DOM.

Related:DOM Browser Rendering Single-Page Application AJAX JavaScript Challenge

JavaScript Challenge

Challenge Handling

An interstitial that runs JS the client must execute correctly to prove it is a real browser before the page loads.

Related:Cloudflare Browser Rendering Headless Browser Cookie Challenge 503 Service Unavailable

JSON

Web Scraping APIs

JavaScript Object Notation — the lightweight text format used for request payloads and structured API responses.

Related:REST API HTTP Response Request Payload Structured Data Data Extraction

JSON-LD

Web Technologies

Structured data embedded in pages (often in <script type="application/ld+json">) that scrapers can read directly.

Related:Structured Data JSON Data Extraction HTML Parsing

Kasada

Challenge Handling

An anti-bot system that ships obfuscated JS challenges and proof-of-work to raise the cost of automation.

Related:JavaScript Challenge Obfuscation Proof of Work Bot Detection

Link Extraction

Crawling

Pulling href targets from a page to feed the frontier, normalizing relative and absolute URLs.

Related:HTML Parsing Crawl URL Frontier URL Deduplication Pagination

lxml

Python Web Scraping

A fast C-backed Python library for parsing HTML/XML with full XPath support.

Related:BeautifulSoup XPath HTML Parsing

Managed Challenge

Challenge Handling

Cloudflare's adaptive interstitial that picks a challenge (JS, Turnstile) based on the request's risk score.

Related:Cloudflare JavaScript Challenge Cloudflare Turnstile Bot Score

Mobile Proxy

Proxies

A proxy routed through a cellular carrier; shared CGNAT IPs make bans costly for sites, so they are highly trusted.

Related:Proxy Server Residential Proxy IP Reputation Geotargeting

Mouse Movement Tracking

Challenge Handling

Recording cursor paths and timing; perfectly straight or absent movement betrays automation.

Related:Behavioral Analysis Page Interaction Bot Detection

navigator.webdriver

Challenge Handling

A JS flag set true under automation; the first thing anti-bot scripts read to catch unpatched headless browsers.

Related:WebDriver Detection Headless Detection Browser Fingerprint Selenium

Obfuscation

Reverse Engineering

Deliberately scrambling JS so anti-bot logic and tokens are hard to read or replicate.

Related:Deobfuscation Reverse Engineering Kasada Sensor Data Proof of Work

Page Interaction

Web Automation

Programmatic clicks, scrolls and key presses that drive a page through steps to reveal or load data.

Related:Click Automation Form Submission Infinite Scroll Mouse Movement Tracking Browser Automation

Pagination

Web Scraping APIs

Traversing multi-page result sets by following next-page links or incrementing page parameters until data is exhausted.

Related:Crawl Link Extraction Infinite Scroll Data Extraction

pandas

Python Web Scraping

A Python data-analysis library used to clean, transform and export scraped data into tables and files.

Related:Structured Data Data Extraction JSON

Pay Per Success

Web Scraping APIs

A pricing model where only successful requests are billed, so blocked or failed attempts cost nothing.

Related:Credits Scrappey Web Scraping API

PDF Rendering

Web Scraping APIs

Generating a PDF of a fully rendered page, preserving layout and styling for reports or archival.

Related:Screenshot API Browser Rendering Headless Browser

PerimeterX (HUMAN)

Challenge Handling

A bot-defense platform (now HUMAN) relying on heavy client-side JS sensors and behavioral signals.

Related:Bot Detection Behavioral Analysis Sensor Data Fingerprinting

Playwright

Web Automation

A cross-browser automation library (Chromium, Firefox, WebKit) with auto-waiting, contexts and network interception.

Related:Puppeteer Playwright (Python)Browser Context Wait for Selector Browser Automation

Playwright (Python)

Python Web Scraping

Python bindings for Playwright, driving Chromium, Firefox and WebKit with auto-waiting and browser automation options.

Related:Playwright Selenium (Python)Browser Rendering Wait for Selector

Polling

Web Scraping APIs

Repeatedly checking a job endpoint until an asynchronous task reports completion.

Related:Asynchronous Request Webhook Retry Logic

POST /request

Web Scraping APIs

Scrappey's primary endpoint: a POST call describing the target URL, command and options, returning the rendered result.

Related:Scrappey API Endpoint HTTP Request Browser Rendering Session Request Payload

Proof of Work

Challenge Handling

A computational puzzle the client must solve, taxing mass automation more than individual users.

Related:Kasada JavaScript Challenge Obfuscation

Proxy Pool

Proxies

A managed set of IPs the scraper draws from, balancing freshness, geography and reputation across requests.

Related:Proxy Server Rotating Proxy IP Reputation Geotargeting Concurrency

Proxy Server

Proxies

An intermediary that forwards requests so the target sees the proxy IP, not yours — the backbone of large-scale scraping.

Related:Residential Proxy Datacenter Proxy Rotating Proxy Proxy Pool IP Ban Geotargeting

Puppeteer

Web Automation

A Node library controlling Chromium over the DevTools Protocol for automation and rendering.

Related:Playwright Chrome DevTools Protocol Headless Browser Browser Automation

Rate Limit

Web Scraping APIs

A cap on how many requests a client may make per time window; exceeding it typically returns HTTP 429.

Related:API Key 429 Too Many Requests Throttling Concurrency Retry Logic

Rate Limiting (defense)

Challenge Handling

A server defense capping requests per IP or token over time, returning 429s or blocks when exceeded.

Related:Rate Limit 429 Too Many Requests IP Ban Throttling

reCAPTCHA

Challenge Handling

Google's CAPTCHA with score-based (v3) and image-challenge (v2) modes that weigh behavior and reputation.

Related:CAPTCHA hCaptcha CAPTCHA Solving Bot Score Behavioral Analysis

Redirect Chain

HTTP Errors

A sequence of redirects between the requested URL and the final page; long chains can hide cloaking or anti-bot gates.

Related:301 Moved Permanently 302 Found Cloaking

Request Payload

Web Scraping APIs

The JSON body of a request that carries parameters such as the target URL, headers, proxy choice and browser commands.

Related:POST /request JSON HTTP Headers Proxy Server Cookies

Requests (Python)

Python Web Scraping

The popular Python HTTP client for simple synchronous scraping of static pages and APIs.

Related:HTTPX BeautifulSoup TLS Fingerprint HTTP Request HTTP Headers

Residential Proxy

Proxies

A proxy whose IP belongs to a real ISP-assigned home connection, making traffic look like an ordinary user and harder to block.

Related:Proxy Server Datacenter Proxy Mobile Proxy IP Reputation Geotargeting

REST API

Web Scraping APIs

An HTTP interface that exposes resources via predictable URLs and verbs (GET/POST), exchanging JSON payloads — the common shape of a scraping API.

Related:API Endpoint HTTP Request HTTP Response JSON HTTP Status Code

Retry Logic

Web Scraping APIs

Re-attempting failed requests, often with exponential backoff, to ride out transient errors and rate limits.

Related:Rate Limit 429 Too Many Requests 503 Service Unavailable Exponential Backoff Timeout

Reverse Engineering

Analyzing a site's client code and network traffic to understand and replay its private APIs or anti-bot logic.

Related:Deobfuscation XHR / Fetch AJAX Sensor Data Obfuscation

robots.txt

Crawling

A file declaring which paths crawlers may or may not access; respecting it is core to ethical scraping.

Related:Sitemap Web Crawler Crawl Budget

Rotating Proxy

Proxies

A proxy that assigns a new IP per request or interval from a pool, spreading traffic to dodge rate limits and bans.

Related:Proxy Server Proxy Pool Sticky Session IP Ban Rate Limiting (defense)

Scrappey

Web Scraping APIs

A pay-per-success web data API that combines headless browser rendering, rotating proxies and full browser session handling behind one request endpoint.

Related:Web Scraping API POST /request Browser Rendering Session Pay Per Success CAPTCHA Solving

Scrapy

Python Web Scraping

A batteries-included Python framework for building crawlers with spiders, pipelines and built-in concurrency.

Related:Spider Web Crawler aiohttp Data Extraction Crawl

Screenshot API

Web Scraping APIs

A capability that returns a rendered image of a page, useful for visual QA, archiving or capturing content that resists text extraction.

Related:Browser Rendering Headless Browser PDF Rendering Page Interaction

Selenium

Web Automation

The veteran browser-automation framework driving browsers via the WebDriver protocol across many languages.

Related:WebDriver Selenium (Python)navigator.webdriver Browser Automation WebDriver Detection

Selenium (Python)

Python Web Scraping

Python bindings for Selenium WebDriver, automating real browsers for JS-heavy pages.

Related:Selenium WebDriver Playwright (Python)Browser Rendering

Sensor Data

Challenge Handling

Encrypted telemetry payloads (Akamai/PerimeterX) generated client-side from device and behavior signals and sent for scoring.

Related:Akamai Bot Manager PerimeterX (HUMAN)Obfuscation Behavioral Analysis

Session

Web Scraping APIs

A persisted browser/proxy context that keeps cookies, headers and IP stable across requests so multi-step flows stay authenticated.

Related:Cookies Sticky Session Browser Context POST /request Proxy Server

Single-Page Application

Web Technologies

A site that renders content client-side via JS, so the initial HTML is near-empty and needs a browser to scrape.

Related:JavaScript DOM Browser Rendering AJAX Infinite Scroll

Sitemap

Crawling

An XML file listing a site's URLs to help crawlers find pages efficiently without deep link discovery.

Related:robots.txt Crawl URL Frontier Web Crawler

Soft Block

HTTP Errors

A 200 response that hides a block — a captcha page or empty shell served instead of real content.

Related:CAPTCHA Challenge Page 200 OK Cloaking Bot Detection

Spider

Crawling

Another name for a crawler, especially a Scrapy component defining how to follow links and parse responses.

Related:Web Crawler Scrapy Crawl Link Extraction

Sticky Session

Proxies

A proxy mode that keeps the same IP for a set duration so multi-step, logged-in flows are not broken by rotation.

Related:Rotating Proxy Session Proxy Pool Cookies

Structured Data

Web Scraping APIs

Data shaped into predictable fields and types (rows, JSON objects) ready for storage or analysis, as opposed to free-form HTML.

Related:JSON Data Extraction pandas JSON-LD

Throttling

Web Scraping APIs

Deliberately slowing the request rate to stay under limits and avoid triggering anti-bot defenses.

Related:Rate Limit Concurrency Retry Logic Rate Limiting (defense)

Timeout

Web Scraping APIs

A limit on how long a request may run before being aborted; common on slow renders or unresponsive targets.

Related:Retry Logic 408 Request Timeout 504 Gateway Timeout Wait for Selector

TLS Fingerprint

Challenge Handling

A signature derived from the TLS ClientHello (cipher suites, extensions, order) that reveals the real client library behind any user agent.

Related:JA3 JA4 TLS Handshake Fingerprinting HTTP/2 Fingerprint

TLS Handshake

Challenge Handling

The negotiation that establishes an encrypted connection; its ClientHello is what TLS fingerprinting inspects.

Related:TLS Fingerprint JA3 JA4

URL Deduplication

Crawling

Filtering already-seen URLs (often via a hash set or Bloom filter) so the crawler does not refetch pages.

Related:URL Frontier Link Extraction Crawl

URL Frontier

Crawling

The prioritized queue of URLs waiting to be crawled, managing order, politeness and revisits.

Related:Crawl URL Deduplication Breadth-First Crawl Web Crawler

User-Agent

Web Scraping APIs

A header string identifying the client browser and OS; mismatched or default values are an easy anti-bot tell.

Related:HTTP Headers User-Agent Analysis Browser Fingerprint HTTP Request

User-Agent Analysis

Challenge Handling

Validating the UA string against TLS and JS signals; inconsistencies expose spoofed clients.

Related:User-Agent TLS Fingerprint Fingerprinting Bot Detection

Wait for Selector

Web Automation

Pausing automation until a target element appears, ensuring async content has loaded before extraction.

Related:Page Interaction Click Automation Timeout Single-Page Application Playwright

Web Application Firewall

Challenge Handling

A filter between client and app that blocks requests matching malicious or bot-like rule sets.

Related:Cloudflare Imperva / Incapsula Cloudflare Error 1020 403 Forbidden Bot Detection

Web Crawler

Crawling

A program that systematically follows links to discover and fetch pages across a site or the web.

Related:Crawl Spider URL Frontier Link Extraction Web Scraping

Web Scraping

Web Scraping APIs

The automated extraction of data from websites by programmatically requesting pages and parsing their content into structured form.

Related:Web Scraping API Data Extraction HTML Parsing Web Crawler Bot Detection

Web Scraping API

Web Scraping APIs

A hosted service that fetches and renders target pages on your behalf, handling proxies, browsers and verification flow handling behind a single HTTP endpoint.

Related:Scrappey REST API Browser Rendering Proxy Server Browser Access Challenges Pay Per Success

WebDriver

Web Automation

The W3C protocol and API standard for controlling browsers, the basis of Selenium.

Related:Selenium Chrome DevTools Protocol WebDriver Detection navigator.webdriver

WebDriver Detection

Challenge Handling

Detecting automation frameworks by their injected properties and protocol artifacts.

Related:navigator.webdriver Selenium WebDriver Headless Detection

WebGL Fingerprint

Challenge Handling

A fingerprint from WebGL rendering and reported GPU strings, revealing the graphics stack of the client.

Related:Canvas Fingerprint Fingerprinting Browser Fingerprint

Webhook

Web Scraping APIs

A callback URL the API posts results to when an asynchronous job finishes, removing the need to poll.

Related:Asynchronous Request Polling REST API

WebSocket

Web Technologies

A persistent bidirectional connection used for live data; scraping it requires speaking the socket protocol.

Related:XHR / Fetch AJAX Reverse Engineering

XHR / Fetch

Web Technologies

Browser APIs that issue background HTTP calls; intercepting them often reveals a clean JSON API to scrape directly.

Related:AJAX JSON Reverse Engineering WebSocket

XPath

Web Technologies

A query language for navigating XML/HTML trees, more expressive than CSS selectors for complex targeting.

Related:CSS Selector lxml DOM HTML Parsing