Engineering

Scraping a school portal with no API: dropping the browser for raw HTTP

June 20, 20267 min read

My first version took about a minute to run and weighed two hundred megabytes. A headless Chromium, driven by Playwright, logging into the school portal and clicking through pages exactly like a human would. It worked. I deleted it within a week.

This is the second part of a series about building a small school-assignment assistant for one child. The portal in question is a school ERP with no public API: the only way in is the same login form a parent uses. The job of this layer is unglamorous and load-bearing. Get the classwork out of a website that was never meant to be read by anything except a browser, reliably enough to run unattended before breakfast. Scraping a school portal over raw HTTP turned out to be the right approach once I confirmed the site was server-rendered. This part is about why the obvious tool was the wrong one, and what replaced it.

TL;DR

Scraping a school portal with no API is straightforward once you confirm the site is server-rendered: a headless browser is unnecessary overhead. This post replaces a Playwright/Chromium setup with Node's built-in https module, a plain cookie jar object, manual redirect-following, and regex-based HTML parsing. The result runs in about one second from a cron job, has no third-party dependencies, and has required no maintenance for months. For a narrow, single-user target, the dumb tool is the robust one.

The browser is a liability when nobody is watching

A headless browser is the right call when a site fights back: heavy JavaScript rendering, dynamic content, bot defenses that key on whether you execute their scripts. It is the wrong call when none of that is true and the thing has to run on a schedule on a small box. Playwright wants a real Chromium on disk. It wants memory. It is slow to start, and when it fails it fails in ways that are annoying to debug at 6am from a log line. For a cron job whose entire purpose is to be boring and dependable, every one of those is a cost with no matching benefit.

The portal, it turned out, did not fight back at all. It was a server-rendered site from an older school of web design. Log in, get a session cookie, request a page, get HTML with the data already in it. No JavaScript required to see the content. Once I confirmed that with a couple of curl calls, the browser had nothing left to justify it.

What replaced it: scraping a school portal over raw HTTP

The whole fetcher is built on Node's standard https module and a cookie jar that is just an object. Logging in is a form POST; I read the Set-Cookie headers off the response and keep the session token.

async function login() {
  const body = querystring.stringify({ username: USERNAME, password: PASSWORD });
  const res = await httpRequest({
    hostname: PORTAL_HOST,
    path: "/login",
    method: "POST",
    headers: { "Content-Type": "application/x-www-form-urlencoded" },
  }, body);
 
  const jar = {};
  for (const header of setCookiesFrom(res)) {
    const [name, value] = header.split(";")[0].split("=");
    jar[name.trim()] = value.trim();
  }
  if (!jar["session"]) throw new Error(`Login failed (status ${res.statusCode})`);
  return jar;
}

The one piece of real machinery is following redirects by hand, because the portal bounces you through a couple of 302s after login and the standard library will not chase them for you. So httpGet loops up to a fixed number of hops, accumulating cookies as it goes, until it lands on a non-redirect response. That is the entire networking story. There is no framework. There is no retry-with-backoff cleverness. If a request fails, the run fails, and the next morning's run tries again, which for a once-a-day job is a perfectly good recovery strategy.

Parsing with regex, on purpose

Here is the choice that will make some readers wince: I parse the HTML with regular expressions. Not a DOM library, not a real parser. A handful of patterns that pull table rows, then cells, then strip tags out of the cell text.

const rowRe = /<tr[^>]*class="[^"]*assignment-row[^"]*"[^>]*>([\s\S]*?)<\/tr>/gi;

The standard advice is correct in general and wrong here. You should not parse arbitrary HTML with regex, because HTML is not a regular language and the web is adversarially messy. But I am not parsing arbitrary HTML. I am parsing one table, on one page, on one portal, whose markup has a stable shape I can read with my own eyes. A real parser would add a dependency and an abstraction to handle a generality I do not have. The regex is shorter, has no dependencies, and when the portal changes its markup, the regex breaks loudly and I fix it in five minutes. A DOM parser would not have saved me from a markup change anyway; it would just have failed differently.

The honest cost is that this is coupled to the portal's current HTML, tightly. I accepted that in Part 1 as the whole bargain of building for one user. This is where the bill comes due, and it is a small bill.

The date hack I am not proud of but would write again

The portal renders due dates as human strings like "Tuesday, 5 March 2026". I need to match a target date against that. The correct approach is to parse the string into a real date and compare. The approach I shipped checks whether the day number, the month name, and the year all appear in the string:

function isDateMatch(dateString, target) {
  const d = new Date(target);
  return new RegExp(`\\b${d.getDate()}\\b`).test(dateString) &&
    dateString.includes(d.toLocaleString("default", { month: "long" })) &&
    dateString.includes(String(d.getFullYear()));
}

This is not date parsing. It is three substring checks. It would happily mismatch in a locale with a different month name, and it leans on a word boundary so that "5" does not match "15". But it handles every format the portal actually emits, it never throws on a string shape I did not anticipate, and a real date parser would have given me new failure modes (timezone drift, ambiguous formats) in exchange for correctness I do not need. For one portal in one locale, the dumb check is the dependable one.

What I would tell my past self

The instinct to reach for the powerful tool, the headless browser, the proper parser, the date library, is the instinct to handle problems you do not have yet. Sometimes that is foresight. For a tool with one user and a known, narrow target, it is mostly weight. The version that survived is the one that does the least: a standard-library HTTP client, an object for cookies, a few regexes, and three substring checks for dates. It is brittle in exactly the places I can see, and it runs in about a second from a cron job that I have not had to think about in months.

The next part is where the data stops being the hard part and the judgment starts: handing these scraped notes to a language model and asking it to throw almost all of them away.

FAQ

Do I need a headless browser to scrape a school portal?

Not if the portal is server-rendered. If a couple of curl calls return the content you need, a session cookie from a form POST is all you need. A headless browser adds hundreds of megabytes of dependencies, slow startup, and fragile failure modes for no benefit.

How do I handle cookies when scraping with Node's built-in https module?

Read the Set-Cookie headers off the login response, parse each one into a name-value pair, and store them in a plain object. Pass that object as a Cookie header on subsequent requests. No library required.

How do I follow redirects manually in Node https?

Loop your request function up to a fixed number of hops, checking the status code each time. On a 3xx response, extract the Location header and repeat, accumulating any new cookies along the way, until you land on a non-redirect response.

Is it okay to parse HTML with regex instead of a DOM parser?

For a single, stable page on one portal whose markup you can read yourself, regex is a reasonable choice. It adds no dependencies, fails loudly when the markup changes, and a DOM parser would not protect you from markup changes anyway. The standard advice against regex HTML parsing applies to arbitrary, adversarially messy HTML.

How do I match a due date string like 'Tuesday, 5 March 2026' in JavaScript without a date library?

Check that the day number (as a word-boundary match), the full month name, and the year all appear in the string. It is three substring checks rather than true date parsing, but it handles every format a single portal emits without introducing timezone or locale failure modes from a date library.