Photo of DeepakNess DeepakNess

Best web scraping workflow using Codex

I needed to scrape 10s of thousands of rows of data from a website with strict rate-limits and all. Earlier, I would have used a tool like Octoparse for this, but this time I asked OpenAI's Codex to do it.

I asked the agent to scrape the website with required data points by sending it a very simple 1 line prompt:

scrape all pages from XYZ website with all important data points, start this as a /goal and don't stop until all required and important information is scraped.

Web scraping using Codex

And Codex started the process as goal that ran for more than 13 hours 46 minutes and used only 757k tokens. Because AI itself wasn't scraping the website, but Codex generated and executed the scraper, monitored failures, patched the code, handled 404/410 cases, adjusted for rate limits, added fallback logic, exported the dataset, and ran strict validation.

write code → run it → inspect errors → patch code → resume → validate → repeat

Next time, if I have to refresh the scraped data from the website, I just have to run python scraper.py and it will run without errors as Codex has perfected the script now.

Cool, right?

Webmentions

What’s this?