B Corp Directory: searching for good merchants with AI
I've always been drawn to the problem of product discovery. Over the years I've taken several swings at it: a tool called Essentials that let you photograph your EDC and tag the products, a ChatGPT skill that searched Wirecutter for recommendations, and PingRoll at AdRoll which surfaced suggestions from Shopify feeds.
None of them cracked it. But each attempt sharpened the question I was actually trying to answer: how do I find good merchants?
That's the angle I took with B Corp Directory. Every certified B Corp has a charter for good — a verified commitment to social and environmental standards. What if I could crawl every single one, figure out which ones sell products, and then let people search across all of them using natural language?
That's what I built.
How it works
The project has four stages.
First, a Scrapy spider with Playwright for JS rendering visits every company profile on the official B Corp site — roughly 10,000 companies. It extracts name, B Impact Score, location, industry, certification date, website, and description. With concurrent requests capped at 4 and a 1-second delay for polite crawling, the full scrape takes about two hours.
Second, a script reads the scraped JSON, deduplicates by URL, and batch-inserts everything into a PostgreSQL companies table.
Third, each company's website gets checked for a /products.json endpoint — a simple, reliable way to detect Shopify stores. For every store found, the crawler paginates through all products (up to 5,000 per store), pulling titles, descriptions, images, prices, and metadata. This surfaced around 85,000 products from roughly 25,000 B Corps.
Finally, every product and company gets a 384-dimensional vector embedding generated locally using all-MiniLM-L6-v2 via transformers.js. These vectors are stored in PostgreSQL using the pgvector extension, which enables semantic search with cosine similarity.
The result: you can search for something like "coffee beans from sustainable companies" and get meaningful results ranked by relevance, with each company's B Impact Score visible alongside.
The web app
The frontend is deliberately simple — a single-file Express server with embedded HTML templates. Three pages: a homepage with AI-powered semantic search, example queries, and a grid of random featured products; a directory page with a paginated table of all B Corps; and a company detail page with metadata and a product grid.
Search works in two layers: semantic search using pgvector cosine similarity as the primary path, with PostgreSQL full-text search (tsvector ranking) as a fallback.
Tech decisions
A few choices worth calling out.
Embeddings are computed locally with transformers.js — no external AI APIs. The model (~90MB) is pre-downloaded at Docker build time so there's no cold-start penalty and no ongoing API costs. Total AI spend: zero.
All routes, templates, and database logic live in one index.js. No framework overhead, easy to reason about. It's about 1,000 lines — large for a single file, but small for an entire application.
Limiting product discovery to Shopify stores (via the /products.json trick) was a pragmatic constraint for the MVP. It won't catch every B Corp that sells things, but it catches enough to be useful and keeps the crawler straightforward.
PostgreSQL handles everything: relational data, full-text search, and vector search all in one database. pgvector made this possible without bolting on a separate vector DB.
The stack
| Layer | Technology |
|---|---|
| Backend | Node.js, Express |
| Database | PostgreSQL 16, pgvector |
| Embeddings | transformers.js, all-MiniLM-L6-v2 |
| Scraper | Python, Scrapy, Playwright |
| Infrastructure | Docker Compose (local), Fly.io (production) |
What I learnt
I doubt this project will become a product, but that wasn't really the point. I got to build another crawler (Scrapy remains a favourite), understand embeddings and vector databases in proper detail — not just conceptually, but the mechanics of generating, storing, and querying them — use my starter repo in anger again, and ship something end-to-end from scraper to deployment.
One thing I didn't expect: because every B Corp has a score from the certifying authority, you can search for a product category and immediately see which companies are the most committed to their social charter. That feels like it could be useful to someone, even if that someone is just me.