lightpanda-browser

Network Layer

The network layer in Lightpanda handles all HTTP communication, WebSocket connections, robots.txt compliance, and proxy configuration. Built on top of libcurl, it provides a high-performance, asynchronous networking stack purpose-built for headless browser automation.

Parent topic: Architecture

Overview

Lightpanda’s network layer is composed of four main modules:

HTTP Core

The HTTP core module wraps libcurl to provide connection management, header handling, and multiplexed request processing.

Connections

Each HTTP connection is represented by a Connection struct that wraps a libcurl easy handle. Connections are initialized with configuration for timeouts, redirects, proxy, TLS, and compression:

const conn = try Net.Connection.init(ca_blob, config);
try conn.setURL("https://example.com");
try conn.setMethod(.GET);
const status = try conn.request(&http_headers);

Key connection settings applied at initialization:

Setting Description Source
timeout_ms Total request timeout Config.httpTimeout()
connect_timeout_ms TCP connection timeout Config.httpConnectTimeout()
max_redirs Maximum redirect hops Config.httpMaxRedirects()
follow_location Automatic redirect following Always enabled
accept_encoding Compression support (gzip, etc.) Auto-detected

HTTP Methods

The network layer supports all standard HTTP methods through the Method enum:

pub const Method = enum(u8) {
    GET, PUT, POST, DELETE, HEAD, OPTIONS, PATCH, PROPFIND,
};

Headers

The Headers struct manages request headers as a libcurl linked list. It supports iteration, cookie injection, and custom header addition:

var headers = try Net.Headers.init(user_agent_header);
defer headers.deinit();
try headers.add("Content-Type: application/json");

Response headers can be read through two iterator types: CurlHeaderIterator for live responses, and ListHeaderIterator for injected responses (used by CDP request interception).

Authentication

The AuthChallenge struct parses WWW-Authenticate and Proxy-Authenticate headers, supporting Basic and Digest authentication schemes:

const challenge = try AuthChallenge.parse(status, header_value);
// challenge.source: .server or .proxy
// challenge.scheme: .basic or .digest

Multi-Handle Management

For concurrent requests, the Handles struct wraps libcurl’s multi interface. It manages a pool of connections with configurable host-level concurrency limits:

var handles = try Net.Handles.init(config);
try handles.add(&conn);
const running = try handles.perform();
try handles.poll(extra_fds, timeout_ms);

The multi handle uses curl_multi_poll for efficient I/O multiplexing, and readMessage retrieves completed transfer results.

HTTP Client

The HttpClient is the high-level network client tied to a browser page. It manages the full lifecycle of HTTP requests including queuing, transfer management, robots.txt checking, and CDP network event integration.

Request Lifecycle

  1. Request creation – A Request is created with URL, method, headers, and optional body
  2. Robots.txt check – If robots enforcement is enabled, the client checks whether the URL is allowed before proceeding
  3. Queue or execute – If connections are available, the request starts immediately; otherwise it is queued
  4. Transfer – A Transfer object tracks the active request, manages response buffering, and handles callbacks
  5. Completion – Response data is delivered, the connection is returned to the pool, and queued requests are started

Connection Pooling

The client maintains connections through a doubly-linked list (in_use). When a transfer completes, the connection handle is recycled. The max_host_connections setting (from Config.httpMaxHostOpen()) limits concurrent connections per host, preventing resource exhaustion.

Request Queuing

When all connection handles are in use, new requests are added to a TransferQueue (a doubly-linked list). As transfers complete, queued requests are dequeued and started automatically.

Request Interception (CDP)

The HTTP client supports CDP network request interception. When a CDP client is attached, requests can be paused, modified, or fulfilled before they reach the network:

// CDP can intercept requests at two stages:
// 1. Before the request is sent (Fetch.requestPaused)
// 2. After response headers arrive (Fetch.authRequired)

The intercepted counter tracks paused requests so that network idle detection works correctly even when requests are held by the CDP layer.

Proxy Configuration

Proxy support is configured at initialization and can be changed at runtime through CDP:

// Set via config at startup
const http_proxy = config.httpProxy();

// Changed at runtime via CDP
client.setProxy("http://proxy.example.com:8080");
client.restoreOriginalProxy(); // Revert to config value

Both HTTP and HTTPS proxy protocols are supported. When a proxy is configured, TLS verification settings are applied to the proxy connection as well.

TLS Configuration

TLS verification can be controlled globally. When a CA certificate blob is provided, full host and peer verification is performed. Without it, verification can be disabled (useful for development):

conn.setTlsVerify(true, use_proxy);
// Verifies both ssl_verify_host and ssl_verify_peer
// Also applies to proxy if use_proxy is true

WebSocket

The WebSocket module implements the WebSocket protocol (RFC 6455) for the CDP server. It handles the upgrade handshake, message framing, and bidirectional communication.

Connection Upgrade

The upgrade process validates the HTTP upgrade request, checking for required headers (Upgrade, Connection, Sec-WebSocket-Key, Sec-WebSocket-Version) and computing the Sec-WebSocket-Accept response using SHA-1:

try ws_conn.upgrade(request);
// Validates HTTP/1.1, required headers
// Responds with 101 Switching Protocols

Message Types

The WebSocket implementation supports all standard frame types:

Type Description
text UTF-8 text data (primary type for CDP JSON messages)
binary Binary data frames
close Connection close with status code
ping Keep-alive ping
pong Keep-alive pong response

Message Reading

The Reader is a streaming parser that handles WebSocket framing, including variable-length headers, masking (client-to-server), and message fragmentation:

var reader = try websocket.Reader(true).init(allocator);
// true = expect masked frames (from client)

while (try reader.next()) |msg| {
    switch (msg.type) {
        .text => handleCDPMessage(msg.data),
        .ping => try ws.sendPong(msg.data),
        .close => break,
    }
}

The reader uses a dynamically growing buffer (starting at 16KB) and supports messages up to CDP_MAX_MESSAGE_SIZE. Fragmented messages are reassembled automatically.

SIMD-Optimized Masking

Client-to-server WebSocket frames are XOR-masked per the protocol specification. Lightpanda uses SIMD instructions when available for efficient unmasking:

// Uses std.simd.suggestVectorLength for platform-optimal
// vector width, falling back to scalar XOR for small payloads

Sending Messages

The WsConnection provides several send methods optimized for different use cases:

The send path handles WouldBlock by temporarily switching the socket to blocking mode, avoiding the need for a write queue.

Robots.txt Handling

The robots.txt module implements RFC 9309 for web crawler access control. This is particularly important for Lightpanda’s use in scraping and automation scenarios.

Parsing

The parser processes robots.txt files line by line, extracting User-agent, Allow, and Disallow directives. It handles:

var robots = try Robots.fromBytes(allocator, "MyBot", robots_txt_content);
defer robots.deinit(allocator);

Pattern Matching

Three pattern types are supported:

Pattern Example Behavior
Prefix /admin/ Matches any path starting with the pattern
Exact /admin$ Matches the path exactly (trailing $ anchor)
Wildcard /*.php * matches zero or more characters

Rules are sorted by pattern length (longest first), with Allow winning ties. This ensures the most specific rule takes precedence.

Robot Store

The RobotStore provides thread-safe caching of parsed robots.txt files per domain. It uses a case-insensitive hash map protected by a mutex:

// Check cache first
if (store.get(domain)) |entry| {
    switch (entry) {
        .present => |robots| return robots.isAllowed(path),
        .absent => return true, // No robots.txt found
    }
}

// Fetch and cache
const robots = try store.robotsFromBytes(user_agent, bytes);
try store.put(domain, robots);

When a robots.txt file is not found (HTTP 404), the store records it as .absent to avoid repeated fetches.

Integration with HTTP Client

The HTTP client integrates robots.txt checking into the request lifecycle. When robots enforcement is enabled, the client:

  1. Checks the RobotStore cache for the target domain
  2. If not cached, queues the original request and fetches the robots.txt first
  3. Evaluates the rules against the request path
  4. Proceeds or blocks the request based on the result

Multiple requests to the same uncached domain are batched in a pending_robots_queue so that only one robots.txt fetch is made per domain.

Configuration Reference

The network layer is configured through the Config module. Key settings:

Setting Method Description
HTTP timeout httpTimeout() Total request timeout in milliseconds
Connect timeout httpConnectTimeout() TCP connection timeout in milliseconds
Max redirects httpMaxRedirects() Maximum number of HTTP redirects to follow
Max host connections httpMaxHostOpen() Concurrent connections per host
HTTP proxy httpProxy() Proxy URL (HTTP/HTTPS)
TLS verify tlsVerifyHost() Enable/disable TLS certificate verification