Scraping a school portal with no API: dropping the browser for raw HTTP
My first version used a real browser. A headless Chromium, driven by Playwright, logging in and clicking through the school portal exactly like a human would. It worked, it was about two hundred megabytes of dependency, and it took the better part of a minute to answer a question I needed answered every morning from a cron job. I deleted it within a week.
This is the second part of a series about building a small school-assignment assistant for one child. The portal in question is a school ERP with no public API: the only way in is the same login form a parent uses. The job of this layer is unglamorous and load-bearing. Get the classwork out of a website that was never meant to be read by anything except a browser, reliably enough to run unattended before breakfast. This part is about why the obvious tool was the wrong one, and what replaced it.
The browser is a liability when nobody is watching
A headless browser is the right call when a site fights back: heavy JavaScript rendering, dynamic content, bot defenses that key on whether you execute their scripts. It is the wrong call when none of that is true and the thing has to run on a schedule on a small box. Playwright wants a real Chromium on disk. It wants memory. It is slow to start, and when it fails it fails in ways that are annoying to debug at 6am from a log line. For a cron job whose entire purpose is to be boring and dependable, every one of those is a cost with no matching benefit.
The portal, it turned out, did not fight back at all. It was a server-rendered site from an older school of web design. Log in, get a session cookie, request a page, get HTML with the data already in it. No JavaScript required to see the content. Once I confirmed that with a couple of curl calls, the browser had nothing left to justify it.
What replaced it is almost embarrassingly plain
The whole fetcher is built on Node's standard https module and a cookie jar that is just an object. Logging in is a form POST; I read the Set-Cookie headers off the response and keep the session token.
async function login() {
const body = querystring.stringify({ username: USERNAME, password: [REDACTED:secret] });
const res = await httpRequest({
hostname: PORTAL_HOST,
path: "/login",
method: "POST",
headers: { "Content-Type": "application/x-www-form-urlencoded" },
}, body);
const jar = {};
for (const header of setCookiesFrom(res)) {
const [name, value] = header.split(";")[0].split("=");
jar[name.trim()] = value.trim();
}
if (!jar["session"]) throw new Error(`Login failed (status ${res.statusCode})`);
return jar;
}The one piece of real machinery is following redirects by hand, because the portal bounces you through a couple of 302s after login and the standard library will not chase them for you. So httpGet loops up to a fixed number of hops, accumulating cookies as it goes, until it lands on a non-redirect response. That is the entire networking story. There is no framework. There is no retry-with-backoff cleverness. If a request fails, the run fails, and the next morning's run tries again, which for a once-a-day job is a perfectly good recovery strategy.
Parsing with regex, on purpose
Here is the choice that will make some readers wince: I parse the HTML with regular expressions. Not a DOM library, not a real parser. A handful of patterns that pull table rows, then cells, then strip tags out of the cell text.
const rowRe = /<tr[^>]*class="[^"]*assignment-row[^"]*"[^>]*>([\s\S]*?)<\/tr>/gi;The standard advice is correct in general and wrong here. You should not parse arbitrary HTML with regex, because HTML is not a regular language and the web is adversarially messy. But I am not parsing arbitrary HTML. I am parsing one table, on one page, on one portal, whose markup has a stable shape I can read with my own eyes. A real parser would add a dependency and an abstraction to handle a generality I do not have. The regex is shorter, has no dependencies, and when the portal changes its markup, the regex breaks loudly and I fix it in five minutes. A DOM parser would not have saved me from a markup change anyway; it would just have failed differently.
The honest cost is that this is coupled to the portal's current HTML, tightly. I accepted that in Part 1 as the whole bargain of building for one user. This is where the bill comes due, and it is a small bill.
The date hack I am not proud of but would write again
The portal renders due dates as human strings like "Tuesday, 5 March 2026". I need to match a target date against that. The correct approach is to parse the string into a real date and compare. The approach I shipped checks whether the day number, the month name, and the year all appear in the string:
function isDateMatch(dateString, target) {
const d = new Date(target);
return new RegExp(`\\b${d.getDate()}\\b`).test(dateString) &&
dateString.includes(d.toLocaleString("default", { month: "long" })) &&
dateString.includes(String(d.getFullYear()));
}This is not date parsing. It is three substring checks wearing a trench coat. It would happily mismatch in a locale with a different month name, and it leans on a word boundary so that "5" does not match "15". But it handles every format the portal actually emits, it never throws on a string shape I did not anticipate, and a real date parser would have given me new failure modes (timezone drift, ambiguous formats) in exchange for correctness I do not need. For one portal in one locale, the dumb check is the robust one.
What I would tell my past self
The instinct to reach for the powerful tool, the headless browser, the proper parser, the date library, is the instinct to handle problems you do not have yet. Sometimes that is foresight. For a tool with one user and a known, narrow target, it is mostly weight. The version that survived is the one that does the least: a standard-library HTTP client, an object for cookies, a few regexes, and three substring checks for dates. It is brittle in exactly the places I can see, and it runs in about a second from a cron job that I have not had to think about in months.
The next part is where the data stops being the hard part and the judgment starts: handing these scraped notes to a language model and asking it to throw almost all of them away.
