D
P
0

Web Development

WAF Blocks Your `curl` Scraper With 403s and JS Challenges? Move Extraction Into the Logged-In Browser With a JSON-LD Bookmarklet

July 2, 2026·5 min read
WAF Blocks Your `curl` Scraper With 403s and JS Challenges? Move Extraction Into the Logged-In Browser With a JSON-LD Bookmarklet

A client was moving house, digitally. For years they had sold through a third-party platform, and now the catalog needed to come over to the new WordPress site I was building. The job sounded simple: pull their product data, meaning titles, prices, images, and descriptions, from the old platform into the new site. One thing worth stating up front: this was the client's own data, on a platform where they held a legitimate account and were logged in as an authorized user. This is about migrating your own content, not harvesting someone else's site.

The platform offered no usable export, so I took the classic route: a PHP script with cURL on my server, reading product pages one at a time. The first request came back 403. I assumed a missing header and sent a real browser user-agent. Still 403. Sometimes the response was not a 403 at all but an HTML page containing a JavaScript challenge, one of those "checking your browser" interstitials that has to execute before any real content appears. A few attempts later my server's IP was flagged and everything was refused outright. Rotating user-agents changed nothing.

Why this was a dead end

A modern WAF looks at far more than the user-agent. It looks at the TLS fingerprint (cURL and a browser shake hands differently at the TLS level), at whether the client can execute JavaScript, at behavioral patterns, at IP reputation. My server is a datacenter IP that cannot run JavaScript. To a WAF, that is the textbook definition of a bot. There are ways to win: headless browsers, residential proxies, fingerprint spoofing. But that is a brittle arms race that gets more expensive every month, and it felt wrong for work that was entirely legitimate. I stopped and asked a different question: who already passes every one of those checks?

The answer is obvious once you see it: the client's own browser, logged into their own account. It passes the TLS check because it is a browser. It passes the JS challenge because it executes JavaScript. It raises no suspicion because it is exactly the legitimate visitor the WAF exists to protect. So instead of forcing my server to impersonate a browser, move the extraction into the browser.

The fix: a bookmarklet plus JSON-LD

The tool: a bookmarklet, an ordinary bookmark whose URL starts with javascript:. The client's editor opens one of their product pages as usual, clicks the bookmark, done. No extension to install, no tooling, no need for me to touch their account.

The second half: do not parse the HTML. Product pages on almost every commercial platform embed structured data for SEO in a script[type="application/ld+json"] tag. That layer is stable and machine-readable. The platform has every incentive to keep it valid, because Google reads it. CSS classes can change with any deploy; JSON-LD is a contract.

{
  "@context": "https://schema.org",
  "@type": "Product",
  "name": "Example Product",
  "image": ["https://cdn.example.com/p/123.jpg"],
  "description": "Product description here.",
  "offers": {
    "@type": "Offer",
    "price": "249000",
    "priceCurrency": "IDR"
  }
}

The bookmarklet reads every JSON-LD tag on the page, finds the node with @type: "Product", normalizes the fields I need, and posts them to a REST endpoint on the new WordPress site. This is the readable version; to use it, minify it to one line and prefix it with javascript: when saving it as a bookmark:

(async () => {
  const scripts = document.querySelectorAll('script[type="application/ld+json"]');
  let product = null;
 
  for (const tag of scripts) {
    try {
      const data = JSON.parse(tag.textContent);
      const nodes = Array.isArray(data) ? data : data['@graph'] || [data];
      product = nodes.find((n) => n['@type'] === 'Product');
      if (product) break;
    } catch (err) {
      /* skip invalid JSON */
    }
  }
 
  if (!product) {
    alert('No Product JSON-LD found on this page.');
    return;
  }
 
  const offer = Array.isArray(product.offers) ? product.offers[0] : product.offers || {};
  const payload = {
    key: 'SHARED_MIGRATION_KEY',
    name: product.name || '',
    price: String(offer.price || ''),
    currency: offer.priceCurrency || '',
    images: [].concat(product.image || []),
    description: product.description || '',
    sourceUrl: location.href,
  };
 
  const res = await fetch('https://new-site.com/wp-json/migrate/v1/product', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify(payload),
  });
 
  alert(res.ok ? 'Imported: ' + payload.name : 'Import failed: HTTP ' + res.status);
})();

On the WordPress side, the receiving endpoint is registered through the REST API and creates a draft product. I put the key in the request body rather than a custom header so the CORS preflight stays simple; WordPress REST routes already send CORS headers by default:

add_action('rest_api_init', function () {
    register_rest_route('migrate/v1', '/product', [
        'methods'             => 'POST',
        'permission_callback' => function (WP_REST_Request $request) {
            $params = $request->get_json_params();
            $key    = (string) ($params['key'] ?? '');
            return hash_equals((string) get_option('migration_shared_key'), $key);
        },
        'callback'            => function (WP_REST_Request $request) {
            $data = $request->get_json_params();
 
            $post_id = wp_insert_post([
                'post_type'    => 'product',
                'post_status'  => 'draft',
                'post_title'   => sanitize_text_field($data['name'] ?? ''),
                'post_content' => wp_kses_post($data['description'] ?? ''),
            ], true);
 
            if (is_wp_error($post_id)) {
                return new WP_Error('insert_failed', 'Could not create draft', ['status' => 500]);
            }
 
            update_post_meta($post_id, '_price', sanitize_text_field($data['price'] ?? ''));
            update_post_meta($post_id, '_currency', sanitize_text_field($data['currency'] ?? ''));
            update_post_meta($post_id, '_source_images', array_map('esc_url_raw', (array) ($data['images'] ?? [])));
            update_post_meta($post_id, '_source_url', esc_url_raw($data['sourceUrl'] ?? ''));
 
            return ['id' => $post_id, 'status' => 'draft'];
        },
    ]);
});

The endpoint deliberately creates drafts only, never published posts. The editor clicked the bookmark on each product page, drafts appeared in the WordPress admin, and they reviewed everything before going live. Once the migration was done, I removed the endpoint and the key.

The takeaway

The irony of this story: I spent half a day trying to make a server look like a browser while a real browser sat idle at the client's office. The checklist I took home:

  • When a WAF blocks server-side access to data you legitimately own, the authenticated browser is the API. Stop fighting; change position.
  • Prefer JSON-LD or other structured data over DOM scraping. Selectors are fragile; the SEO contract is stable.
  • A bookmarklet is a zero-install tool a non-technical editor can run. Perfect for a one-off migration task.
  • Keep the receiving endpoint authenticated, make it create drafts only, and shut it down when you are done.

And above all: do this only for data that is yours or your client's, on a platform where you are an authorized user. The same technique on somebody else's site is not a migration; it is a problem.