r/thewebscrapingclub 1h ago

Drop your indie product link — what are you building?

Thumbnail
Upvotes

r/thewebscrapingclub 6h ago

An agent that monitors your scrapers and auto-fixes them when sites change — is this useful or am I overengineering?

1 Upvotes

Full disclosure: I work on Intuned, so this is our thing — but I'm genuinely after feedback from people who run scrapers at scale, not upvotes.

The problem we kept hitting: scrapers don't fail loudly. A site ships a redesign, selectors rot, and you find out days later when someone asks where the data went.

What we built tries to close that loop:

- Monitors each run (success rate, failure count, result size, data drift)

- When something looks off, an agent compares healthy vs failed runs and reads the traces before flagging anything — confirms it's a real break, auto-dismisses false positives

- If you let it, it opens a branch, writes a scoped fix, and can merge + deploy

The autonomy is four separate toggles, so you can run monitoring-only or full autopilot. It's still experimental.

Questions I actually care about:

- How do you handle silent breakage today? Monitoring, manual checks, stakeholder complaints?

- Would you ever trust an agent to auto-merge/deploy a scraper fix, or is that a hard no?

- What would make this trustworthy enough to leave on?

Docs if you want the detail: intunedhq.com/docs/main/02-intuned-agent/self-healing-projects


r/thewebscrapingclub 9h ago

What's your favorite small library or utility for scraping?

Thumbnail
1 Upvotes

r/thewebscrapingclub 23h ago

Build-time scraping + a cron rebuild loop: a $0 way to run an auto-updating stats site with no API, no DB, no server

5 Upvotes

I wanted a sports stats site that updates itself after every game, but without paying for a stats API (~$40/mo) or running a scraper + DB + server.

Ended up with a pattern I think is reusable, so sharing it.

The idea: don't scrape at runtime — scrape at build time, bake the data into static HTML, and let a cron job rebuild the site on a schedule.

Three pieces worth stealing:

  1. Hidden JSON endpoints. A lot of sports sites' own frontends hit undocumented JSON endpoints (ESPN's site.web.api.espn.com/.../athletes/{id}/gamelog

    etc.). No key, returns clean JSON. I just fetch those directly in the build instead of rendering HTML and parsing the DOM.

  2. Bake at build, not runtime. The fetch runs during the static build, so the output is plain HTML — perfect Lighthouse, no client JS for data, nothing

    to attack, $0 hosting. Tradeoff: data is only as fresh as your last build, which leads to…

  3. A cron → deploy-hook rebuild loop. A GitHub Action fires hourly and pings a Cloudflare Pages deploy hook, which rebuilds → re-fetches → re-bakes.

    The whole "live updating" illusion with zero infra:

    on:

schedule: [{ cron: '0 * * * *' }] # hourly

jobs:

rebuild:

runs-on: ubuntu-latest

steps:

- run: curl -X POST "${{ secrets.CF_DEPLOY_HOOK }}"

Parsing gotcha: ESPN returns stats as parallel arrays — a labels array aligned by index with a stats array (and a separate events dict keyed by

eventId). So you index by label name, not by a position you guessed:

const i = labels.indexOf('PTS');

const pts = parseFloat(row.stats[i]); // align stats[] to labels[]

Robustness: every fetch is wrapped with a timeout + try/catch and degrades to null/placeholder, so if ESPN changes a field the build still succeeds

(the page just hides that stat) instead of breaking the deploy.

Live example I built it on: wembanyama.club — the game log / stats pages are all baked this way and refresh hourly. Happy to share the repo / answer anything.


r/thewebscrapingclub 15h ago

SaaS cold start for AI tools — subscriptions + affiliate, global market. Any tips?

Thumbnail
1 Upvotes

r/thewebscrapingclub 1d ago

How to How to Scrape RedNote (Xiaohongshu) Without CodingScrape RedNote (Xiaohongshu) Without Coding

2 Upvotes

If you've tried to pull data from RedNote — the English name for Xiaohongshu (小红书) — you already know it's one of the harder social platforms to scrape. There's no public API, the mobile and web apps are heavily obfuscated, and most "tutorials" stop at a curl command that breaks within a week.

This post covers why RedNote is hard to scrape, the three realistic ways to do it, and a no-code path if you don't want to maintain a scraper yourself.

Why RedNote is harder than TikTok or Instagram

A few things make Xiaohongshu a pain compared to other platforms:

  1. Signed request headers. Every API call to edith.xiaohongshu.com needs valid x-s, x-t, and x-s-common headers. These are generated by an obfuscated JS function (window._webmsxyw) that changes periodically. Replay a captured header and you get a 461 / sign-error within minutes.
  2. Aggressive anti-bot. Hit the same endpoint a few times from a datacenter IP and you'll get a sliding-captcha or a silent empty response. Residential proxies + pacing are basically mandatory.
  3. No official API. Unlike YouTube or (historically) Twitter, there's no developer program. Everything is reverse-engineered from the web/app.
  4. Fast-moving frontend. The note detail payload structure changes, fields get renamed, and noteIdxsec_token coupling means you often can't fetch a note without a fresh token from the feed it appeared in.

So the real problem isn't writing the first request — it's keeping it working.

Option 1 — Roll your own (most control, most maintenance)

The DIY stack usually looks like:

  • A headless browser (Playwright) to log in and grab the signing context, or a reverse-engineered JS signer ported to Python/Node.
  • A residential proxy pool with rotation.
  • Retry + captcha-handling logic.
  • A parser that survives field renames.

This works, and gives you full control. The catch: you're now maintaining an anti-bot arms race. Most teams I've seen spend more time fixing the signer after a Xiaohongshu update than using the data. Fine if scraping is your product — overkill if you just need the data.

Option 2 — Generic scraping platforms (Apify, Bright Data)

Marketplaces like Apify have community "actors" for Xiaohongshu, and Bright Data sells a managed dataset/scraper. This offloads the maintenance.

Trade-offs:

  • Cost. Bright Data in particular gets expensive fast at volume.
  • Coverage gaps. Community actors break when Xiaohongshu updates and the fix depends on whoever maintains that actor.
  • RedNote specifically is thin. Most actors are TikTok/Instagram-first; Xiaohongshu support tends to lag.

Option 3 — A managed API (no code)

If you just want clean JSON without running browsers or babysitting a signer, a managed scraping API is the no-code path. You send a profile URL or note ID, you get structured data back. Someone else eats the anti-bot maintenance.

Things to check before picking one:

  • Does it actually cover RedNote/Xiaohongshu? Many "social scraping APIs" advertise TikTok + Instagram and quietly omit Xiaohongshu. Test the endpoint you actually need.

- **Profiles, posts, and comments?** Comments are where most competitor/audience analysis happens, and they're the first thing cheap APIs drop.

  • Output format. You want flat, predictable JSON — not a raw HTML dump you have to parse again.
  • Pricing model. Per-request beats per-compute-second for predictable cost.

We build SpiderHubs partly to fill the RedNote gap — one API across TikTok, Instagram, YouTube, Douyin and Xiaohongshu, returning profiles, posts and comments as clean JSON, positioned as an affordable Apify / Bright Data alternative. (Disclosure: I work on it.) But the checklist above applies to whatever you pick.

A no-code workflow if you just need the data once

You don't always need an API. If it's a one-off pull:

  1. Find the creator/topic feed you care about.
  2. Use a managed scraper or no-code monitoring tool to pull the latest posts + engagement into a sheet/JSON.
  3. Set it to re-run daily if you're tracking competitors over time — the daily delta is usually what you actually want, not a one-time dump.

That last point is the real reason most people scrape Xiaohongshu: tracking competitors and trending content over time, not a single snapshot. Whatever route you pick, design for the recurring pull, not the first request.

TL;DR

  • RedNote is hard because of signed headers (x-s/x-t), aggressive anti-bot, and no official API.
  • DIY = full control + permanent maintenance.
  • Apify/Bright Data = less maintenance, but cost + thin Xiaohongshu coverage.

- Managed API = no code; just verify it actually covers Xiaohongshu (profiles + posts + comments) and returns clean JSON.

  • Whatever you choose, build for the daily recurring pull, not the one-time request.

What's your current setup for Xiaohongshu data — DIY signer, Apify, or something else? Curious what's holding up best after their recent updates.


r/thewebscrapingclub 1d ago

VectorTrace Update - A local first scraper extension.

1 Upvotes

A few days ago I posted here about building VectorTrace. You guys gave me real feedback❤️

It's a chrome extension that scrapes like a point-and-click tool, but uses on device AI to recover when site layouts change.

The problem it solves

Every scraper breaks when a site redesigns. CSS selectors are positional they don't understand what they're pointing at. When the DOM changes, they silently return null or grab the wrong element.

The standard response is "go fix your selectors manually." That's fine once. It's not fine when you're monitoring 10 sites and they all change.

The extension now has 7 distinct statuses:

  • OK — extracted, matches original
  • HEALED — was broken, auto-repaired
  • SELECTOR_BROKEN — element completely gone
  • ⚠️ TEXT_CONTENT_CHANGED — selector works but grabbed the wrong element (phantom swap)
  • 🔀 TAG_CHANGED<h1> became a <p>, structural drift
  • 👁️ ELEMENT_HIDDEN — display:none, visibility:hidden
  • 📄 EMPTY_PAGE — page has no meaningful content (bot block, loading error, etc.)

What it doesn't do (yet)

Being upfront: no pagination, no scheduled extraction, no multi-page crawl. It's a single-page point-and-click scraper. Those are on the roadmap.

GitHub: https://github.com/SathiyaSenpai/VectorTrace


r/thewebscrapingclub 2d ago

Config-over-code for brittle data ingestion

0 Upvotes

I’ve been thinking about how brittle data ingestion gets when upstream sources constantly drift.

The annoying part usually isn’t getting data once. It’s keeping integrations alive when fields move, names change, sessions behave differently, or payloads get new edge cases.

I started moving more of this into a config-over-code approach where external sources are described instead of hardcoded. The surprising part is that I ended up writing less code overall, because a lot of the repeated scraper/ETL logic became source definitions instead of one-off implementation details.

Curious if other data engineering / scraping folks have run into this same pain at scale.


r/thewebscrapingclub 4d ago

Browser fingerprinting & anti-bot benchmark - update

Thumbnail gallery
2 Upvotes

r/thewebscrapingclub 4d ago

At What Point Does a Scraping Stack Stop Being a Moat and Become Technical Debt?

0 Upvotes

I spent years in the CMS industry, and the current "build vs buy" debate in web scraping feels eerily familiar.

Back in the early 2000s, agencies built custom CMS platforms because it seemed strategically smart.

The arguments were always the same:

• We need control.
• We need flexibility.
• Commercial solutions can't handle our requirements.
• This is part of our competitive advantage.

Then requirements exploded: Security, workflows, integrations, scalability, personalization, governance, analytics, multilingual support, etc.

Eventually many agencies realized they weren't building client solutions anymore. They were maintaining CMS products.

Today I hear very similar arguments around scraping infrastructure.

For companies whose moat lives in proprietary data products, trading signals, AI systems, enrichment models, or highly specialized extraction logic, owning parts of the stack absolutely makes sense.

But for everyone else, I wonder:

If your team spends most of its time dealing with proxies, anti-bot systems, browser breakage, rendering issues, parser maintenance, and infrastructure reliability...

• Are you building a competitive advantage?

• Or are you maintaining plumbing that specialized vendors can spread across thousands of customers?

Genuine question for the engineers here:

➤ What specific characteristics make you believe your scraping infrastructure is part of your moat rather than a necessary utility?

➤ Where do you draw that line?


r/thewebscrapingclub 5d ago

Need HELP on reverse-engineering a mobile app’s internal API endpoints (ethical, for personal use) – Certificate pinning, token extraction, and request replay

2 Upvotes

I’m trying to legally access data from a mobile app’s internal API for a personal project. The app fetches data from a third-party service, but this feature is only available on mobile there’s no web equivalent or public API documentation. My goal is to reverse-engineer the app’s network calls to replicate its requests programmatically (e.g., via Python) or anything if someone knows please HELP ME , DM me if you can help me i'll share the full context about


r/thewebscrapingclub 5d ago

We launched ScrapeOps AI Scraper Generator today, built for production workflows, not demo videos

Thumbnail
0 Upvotes

r/thewebscrapingclub 6d ago

Google scraping api worth paying for?

7 Upvotes

Hit a wall with google scraping. blocks, captchas, inconsistent results. Looked into google scraping api options but not sure if they actually solve the problem or just hide it. Is it worth it long term or better to keep control and deal with it?


r/thewebscrapingclub 6d ago

Building a local-first scraper extension that uses on-device ML to fix broken selectors.

2 Upvotes

I am tired of maintaining web scrapers that break silently the moment a website updates its layout or changes its CSS classes.

To solve this, I started building VectorTrace. It is an open-source browser extension that lets you point and click on any webpage to define scraping fields. It uses local machine learning to detect and recover from layout changes automatically.

The Technical Mechanics

When you click an element to define a field, the extension generates a 384-dimensional semantic embedding of that element's text using the all-MiniLM-L6-v2 model. This runs entirely in your browser via Transformers.js.

The embedding vectors are stored directly in IndexedDB. This bypasses the strict 10MB chrome.storage.local limit. When you run the scraper after a site redesign and the primary CSS selector fails, the extension pulls visible text elements from the current page, creates temporary embeddings, and runs a cosine similarity calculation against your stored target vector. It then ranks the replacement candidates with High, Medium, or Low confidence labels.

No cloud processing. No API keys.

Defeating Manifest V3 Service Worker Suspension

Chrome extensions built on MV3 terminate service workers after short periods of inactivity, which breaks long running WASM execution pipelines. To get around this restriction, VectorTrace runs the ONNX runtime inside a Chrome Offscreen Document instead of the main service worker. This keeps the execution environment perfectly stable.

Current Architecture (Day 1 Status)

I used WXT, React 19, and TypeScript to scaffold the project. Here is what is working in the repository right now:

  • Storage Layer: An IndexedDB persistence system for vectors alongside a chrome.storage wrapper that automatically strips embeddings before saving layout schemas.
  • Selector Engineering: A fallback generator that prioritizes IDs, data attributes, and nth-of-type patterns, capped with a 500 character limit guardrail for XPaths.
  • Analysis Engine: Full Offscreen Document routing, complete with cosine similarity scoring and candidate matching logic.

Looking for Technical Feedback

I want to make sure this utility addresses actual scraping pain points before building out the end-to-end automation engine.

  1. Does a text-embedding approach actually help with your workflows, or does it create new failure modes on highly volatile data fields (like stock prices or changing inventory counts)?
  2. Large web pages frequently contain over 3,000 distinct text nodes. What specific frontend filtering strategies would you use to prune the DOM tree before passing strings to the ONNX model?
  3. What capabilities do you want from a free, local-first scraping utility that paid cloud alternatives like Browse AI or Kadoa cannot provide due to their infrastructure limitations?

r/thewebscrapingclub 8d ago

Linkedin profile data costs $99/month apparently. Or $1 per 1000 if you scrape it.

17 Upvotes

So i was helping a friend set up lead gen for his startup last month and he was about to pay $99/month for sales navigator just to get basic profile info like name, job title, past experience, skills. that's literally it. thats all he needed.

Took me a second to realise that most of that stuff is just... publicly available on linkedin anyway? like you dont even need an account to see it.

so i ended up just building a scraper for him. took a while to get it working without getting blocked but eventually figured it out. no cookies, no login, nothing. just paste a linkedin url and get clean json back.

Before that i researched about the pricing details of available actors on apify. They all priced somehow very high.

So i ran some numbers on my scrapper and it works out to about $1 per 1000 profiles on apify.

His sales navigator subscription wouldve been $99/month. for that same $99 he can now scrape like 99,000 profiles lol.

Obviously sales navigator does other stuff too like inmails, search filters, crm sync etc. Not saying its useless. But if ur main use case is just getting profile data at scale, feels kinda insane to pay $99/month for it

Anyway published the actor on apify if anyone wants to try it. Still pretty new so would genuinely appreciate feedback if anyone uses it.

What are you guys using for linkedin data rn? curious if theres better approaches im missing.


r/thewebscrapingclub 8d ago

Built a selector agent into our scraping IDE so you don't have to touch DevTools

5 Upvotes

https://reddit.com/link/1tmlrsp/video/nhj2m7hcz43h1/player

Been building Intuned — a Playwright-based browser automation platform. One thing that kept coming up was the annoying back-and-forth of inspecting elements just to get a selector.

Built /selector-agent into the IDE — you describe the element, it gives you the selector, ready to use in your script.

Video shows the before/after. Would love feedback from people who actually write scrapers.


r/thewebscrapingclub 10d ago

I got tired of my scraper wasting requests on burned proxies, so I made one that self-heals. 36% → 76% success on 550k real requests

3 Upvotes

If you've run scrapers across a pool of proxies, you know the pain: some proxies are fast, some are flaky, some are straight-up banned or dead — and it changes by the hour. Most rotation is just round-robin or random, which means your scraper happily keeps sending requests through proxies that got blocked 10 minutes ago. You end up babysitting it: checking logs, manually disabling bad IPs, tweaking lists.

So I built a proxy manager that does the obvious thing the rotator should've been doing all along: it watches how each proxy is actually performing and stops sending traffic to the ones that are failing right now — automatically, no manual list-pruning.

How it works, in plain terms:

  • It tracks success/failure per proxy, per target site (a proxy banned on site A might be fine on site B).
  • Recent results matter more than old ones, so a proxy that started failing 2 minutes ago gets avoided immediately — but it isn't blacklisted forever; the system keeps lightly testing it and brings it back the moment it recovers.
  • It still occasionally tries "worse" proxies on purpose, so it notices when things change instead of tunnel-visioning on a few favorites.

I didn't want to just claim this works, so I ran it for real: 549,114 requests over 7 days, 10 scrapers (e-commerce, news, public data), residential proxies. Success rates:

  • Smart selection: 76.0%
  • Round-robin: 36.3%
  • Random: 31.5%

Same proxies, same targets — round-robin landed barely half the successful requests. In a nastier test where a third of the proxies were permanently dead, the smart one made 17 failed requests in 24h vs 10,663 for random, because the dumb rotators never stop knocking on dead doors.

(If you want the academic name, it's Thompson Sampling / a multi-armed bandit — it came out of my master's thesis. But you don't need to know any of that to use it. There's also a classic "exponential backoff" mode that's actually better if your main problem is rate-limiting rather than bans.)

It's a full tool, not just a script — ProxyOps, open source (MIT):

  • Add proxies from multiple providers, with expiry dates
  • Your scraper just calls POST /acquire to get a proxy and POST /release to report how it went — that's the whole integration
  • Group proxies and assign them to specific bots
  • Dashboards showing success rate per bot / provider / site, status codes, etc.
  • FastAPI + Vue + PostgreSQL, all Dockerized — docker compose up and it's running

Repo: https://github.com/Paulo-H/proxyops


r/thewebscrapingclub 10d ago

b2b scraper ice vendors

4 Upvotes

Hey everyone! 👋

Building a B2B scraper to find all **mobile ice cream trucks & catering vendors** across all **401 districts in Germany**.

Need to extract **Emails**, **Phones**, and **Owner Names**. Facing 3 classic scraping bottlenecks:

1️⃣ **Discovery (Google Maps at scale)**: Need to query 401 regional keywords. To avoid CAPTCHAs/IP bans, should I use a cheap SERP API (like Serper.dev or SerpApior apify) to pull Maps JSON directly, or is custom Playwright + residential proxies better?

2️⃣ **Social-Media-Only Leads (FB/IG Walls)**: Many small vendors only have Facebook/Instagram. Direct scraping is fragile. Anyone tried using Google Dorking via API (e.g. `site:facebook.com/handle "email"`) to extract emails from Google's index instead?

3️⃣ **Lead Qualification (Filtering static shops)**: Search queries return mostly static ice cream shops (Eisdielen) that don't do catering. I plan to use a fast LLM (Gemini 2.5 Flash) on scraped website HTML. Any smart local regex/heuristics to pre-filter before hitting the LLM?

Would love any tool recommendations or architectural advice! my biggest anxiety is to not get a get usefull amount of contacts. Thanks! 🍦🚀


r/thewebscrapingclub 13d ago

Best Anti-Captcha Browser

Thumbnail
github.com
76 Upvotes

r/thewebscrapingclub 13d ago

Tiktok is cooked

1 Upvotes

https://reddit.com/link/1thuncx/video/cu3t7o52u42h1/player

Have you ever bypassed TikTok that fast?
DM for more info....


r/thewebscrapingclub 14d ago

If you've ever cried at 2am because Cloudflare ate your scraper, this post is for you

9 Upvotes

Hey r/thewebscrapingclub ,

I'm a solutions engineer at Intuned. We build a platform for running browser automations and scrapers in production — Playwright-based, with the infra stuff (proxies, captcha handling, retries, scheduling, storage) handled for you so you can focus on the actual scraping logic.

We're opening up free access and I'd genuinely like feedback from people who do this work day-to-day. Specifically curious what you think about:

- The dev experience vs. rolling your own Playwright + proxy stack

- How it compares to Apify / Browserless / Browse AI for your use cases

- What's missing that would make you actually switch

Not looking for fake praise — if it sucks for your workflow, I want to know why. I spend my days helping customers scrape stuff like government procurement portals, so I've seen what breaks in the real world.

Link in comments to avoid the spam filter. Happy to answer questions about the internals (anti-bot stuff, captcha pipelines, fingerprinting) — that's the part I find most interesting anyway.

Happy to chat in DMs too.


r/thewebscrapingclub 15d ago

I open sourced Cull: an image & prompt web scraping pipeline with local / cloud classification

Thumbnail
gallery
6 Upvotes

open-sourced a tool I built and am maintaining called Cull.
It’s a machine curation engine for AI image datasets, the kind of work that eats hours every time you want to train a LoRA, build a reference library, or just classify an archive that isn’t a 100,000-file mess.

What it does, end to end

  • Scrapes from Civitai (.com and .red), X/Twitter, Reddit, Discord, plus any URL gallery-dl supports (Pixiv, DeviantArt, the booru family, ArtStation, Tumblr, FurAffinity / e621, Imgur, Flickr, and ~340 others).
  • Drops every image plus its source-side prompt into a local queue. Per-source dedup, no database.
  • Classifies each image with a vision-language model, multiple LM Studio instances for local, Groq for cloud, anything OpenAI-compatible — using a strict 17-field JSON schema, so you don’t get free-text replies you have to regex into shape.
  • Sorts the keepers into category folders next to their .txt prompt and a .vision.json audit record. Two score gates (overall quality + topic relevance) you tune in the UI.
  • Surfaces everything through a Flask + Alpine dashboard: start/stop, source toggles, gallery, prompt editor, ZIP export, per-source stats.

Two example use cases I actually used it for:

  • LoRA (300 images) & Finetune (100,000 images) dataset prep.
    • Give it a topic such as Female Influencer or {artist} style art
    • set AUTO_CAPTION_ENABLED=true if you want it to caption images or false if you want it to scrape images (and still store any found prompts from the posts it scraped from) and set whatever style prompting you want.
    • Walk away.
    • Come back to a folder of triaged images split by quality and category, each with a generated SD-prompt .txt next to it.
    • ZIP-export the filtered view straight into your trainer.
  • Ingesting a prompt-less archive. Point LOCAL_IMPORT_DIR at a folder of bare JPEGs (or paste a gallery-dl URL list)
    • Toggle off the prompt requirement, turn on auto-captioning.
    • Every image is classified and sorted, gets a SD-prompt / booru-tags / natural-language caption written by the same vision call that classifies it.
    • So you can train on a years-old archive without curating prompts by hand.

Links

Repo: https://github.com/tlennon-ie/cull
Screenshots: https://imgur.com/a/kSvsAW9

Roadmap is going to keep refining around what people actually use it for. On my list:
- more vision-worker backends
- Improved proper requeue UI
- a small headless CLI,
- Video scraping , classification etc


r/thewebscrapingclub 18d ago

Just open-sourced my personal scraping engine: tiny self-contained binary with Lua scripting

23 Upvotes

I originally built it for myself because I wanted something extremely lightweight that runs in the background like it never existed. It's called SpyWeb.

It's designed to be "set and forget." I've had it running for months on my PC tracking job boards without a single crash or memory leak.

Specific features:

  • Zero Runtime: Self-contained ~7MB binary. No Python, Node, or Docker needed.
  • Low Footprint: Uses <5MB RAM at idle.
  • Lua Scripting: Use Lua to handle complex logic like custom headers, JS rendering, advanced monitoring, etc.
  • Hot Reloading: Change a config or Lua script and the job respawns instantly, no restarts.
  • Web Dashboard: Simple local UI to monitor scrape data in real-time.
  • Desktop Alerts: Built-in support for system notifications and webhooks.
  • Embedded DB: Built-in KV store so you don't need a separate database.
  • CDP Support: Controls any Chromium or CDP-compatible browser via Lua for JS-heavy sites.
  • Dual Mode: CLI for servers and a System Tray version for silent background runs.
  • Deduplication: Internal database ensures you never see the same result twice.

I just released the beta with CDP integration. If you need something that just sits in the background and sips resources while actually being maintainable, check it out.

Set up is very easy and straightforward: for server-side rendered pages, it's just a few lines of config (URL, selectors, fields). For JS-heavy sites, you can write a little Lua to launch a browser and drive the workflow.

You can check it out here: https://github.com/spyweb-app/spyweb


r/thewebscrapingclub 18d ago

I built a Web-Scraper API that is 6-7x more efficient than current ones

5 Upvotes

Runo is a web-scraping API that returns typed, structured JSON. You define a schema (field name, type, example value), and Runo fetches the page and returns the data. No HTML, no parsers, no post-processing.

Over the past few weeks, I have been building this non stop. Currently, every scraper API out there solves the site fetching problem but left the extraction of the actual data entirely to users. Runo makes that completely disappear.

For Runo, I went ahead and added JS rendering, stealth mode, and full LLM extraction to make this a fully functional and capable of scraping most if not all sites.

Also, another major problem with current web scrapers is that they charge per feature or bundle them into expensive credit tiers. A single large or JS rendered request can cost 5-75 credits, which means you essentially get nothing out of their plans. Runo is flat per request, no matter the site. At the Scale tier, Runo works out to $0.90 per 1,000 effective requests vs. around $6 for the nearest Firecrawl equivalent. My jaw dropped when I was testing Runo and came across these numbers.

You can check it out here. I created a free tier that is 500 requests/month, no credit card required. Take it for a spin and let me what can be improved. I would love feedback.


r/thewebscrapingclub 18d ago

What is your opinion on AI agents for web scraping?

Thumbnail
3 Upvotes