I Built a Study-Notes Summarizer in the Browser — No AI API, Just Word-Frequency Scoring

I built a study-notes summarizer in the browser — no AI API, just word-frequency scoring

Finals season is brutal. You have 40 pages of lecture notes, 3 hours left, and a brain that stopped cooperating at page 12.

What if you could paste all those notes and instantly get the 5 sentences that actually matter?

That's what I built today — a fully browser-side study-notes summarizer. No OpenAI key. No backend. No data leaving your machine. Just pure vanilla JavaScript and a technique called extractive summarization.

Here's how it works, explained like you're a person and not a research paper.

The core idea: not all sentences are equal

Think about how you'd summarize something manually. You'd read through, notice which topics keep popping up, and highlight the sentences that mention those topics most.

That's literally what the algorithm does. It counts words.

The insight: sentences that use the most-discussed words are probably the most important sentences.

Step 1: Split the text into sentences

Before scoring anything, we need individual sentences to score.

function splitSentences(text) {
  return text
    .replace(/([.!?])\s+/g, "$1\n")
    .split("\n")
    .map(s => s.trim())
    .filter(s => s.length > 20);
}

We split on ., !, and ?, then throw away anything shorter than 20 characters (those are usually headers or artifacts, not real sentences).

Step 2: Build a word-frequency table — and kill the stopwords

Now we count how often each word appears across the whole text. But here's the trick: we drop stopwords first.

Stopwords are the glue words — "the", "a", "is", "in", "and", "or". They appear everywhere, so if we counted them, every sentence would look equally important. We want signal, not noise.

const STOPWORDS = new Set([
  "the", "a", "an", "is", "it", "in", "on", "at", "to",
  "and", "or", "but", "of", "for", "with", "this", "that",
  "are", "was", "were", "be", "been", "has", "have", "had"
  // ... about 50 more
]);

function wordFreq(sentences) {
  const freq = {};
  for (const s of sentences) {
    for (const w of tokenize(s)) {
      if (!STOPWORDS.has(w) && w.length > 2) {
        freq[w] = (freq[w] || 0) + 1;
      }
    }
  }
  return freq;
}

After this step, you have a dictionary like:

"learning" → 8
"data" → 6
"model" → 5
"overfitting" → 4

Those are your topics. Those are what the text is about.

Step 3: Score each sentence

Now score every sentence based on how many high-frequency words it contains. Divide by sentence length to avoid rewarding long rambling sentences.

function scoreSentence(sentence, freq) {
  const words = tokenize(sentence)
    .filter(w => !STOPWORDS.has(w) && w.length > 2);
  if (!words.length) return 0;
  const sum = words.reduce((acc, w) => acc + (freq[w] || 0), 0);
  return sum / words.length;
}

A sentence like "Machine learning models trained on large datasets achieve better generalization" will score high because learning, models, datasets, and generalization all appear frequently in your notes.

A sentence like "This is also true in many ways" will score near zero. It's filler.

Step 4: Pick the top N — but restore original order

Here's the part beginners miss. If you just return sentences in score order, the summary sounds like a jumbled mess because sentences reference things from earlier in the text.

Sort by score, take the top N, then re-sort by original position.

function pickTop(sentences, scores, n) {
  return scores
    .map((score, i) => ({ i, score }))
    .sort((a, b) => b.score - a.score)  // rank by importance
    .slice(0, n)
    .map(x => x.i)
    .sort((a, b) => a - b)              // restore reading order!
    .map(i => sentences[i]);
}

Now the summary reads like a coherent passage rather than a shuffle of random facts.

Extractive vs abstractive — what's the difference?

Extractive summarization (what we just built) picks real sentences verbatim from the original text. It cannot make things up. It cannot hallucinate. It works offline. It respects your privacy completely.

Abstractive summarization (what ChatGPT does) generates brand new text. It can be more fluent and can combine ideas from multiple sentences. But it can also make stuff up, it needs an API call, and your notes leave your device.

For studying, extractive wins. You want the actual sentences from your notes — not a paraphrase that might subtly change the meaning of something.

What it looks like live

You paste 20 paragraphs of machine learning notes. You set the slider to 5 sentences. You click Summarize.

You get back:

5 of the most information-dense sentences, in reading order
A list of key concepts (the top keywords by frequency)
A compression ratio ("78% compressed — 20 sentences → 5")
A debug panel showing every sentence ranked by score, so you can see exactly why it picked what it picked

Everything happens in your browser. The notes never leave your device.

Why this works better than you'd expect

Word-frequency scoring is 30-year-old technology. It's the foundation of old-school search engines and still powers the "related articles" features of many news sites.

For study notes specifically, it works really well because:

Lecture notes are repetitive by design. Teachers repeat important concepts. The algorithm rewards repeated words.
Your notes already strip fluff. You don't write "it is worth noting that" in your notes — you write the actual fact. Less stopword noise.
You wrote the notes yourself. The sentences already match your mental model of the topic.

The TF (term frequency) scoring you see here is also the first half of TF-IDF, which powers document search at scale. Same concept, extended to compare documents against a corpus.

Try it

The full thing is live at dev48v.infy.uk/solvefromzero.php — Day 6.

Load the sample notes (20 sentences of ML content), try different summary lengths, and flip open the debug panel to see every sentence's score. It's a pretty good intuition-builder for how extractive summarization actually works under the hood.

The source is about 120 lines of vanilla JS — no libraries, no build step, no dependencies.

This is Day 6 of SolveFromZero — I'm building 50 real hackathon problems from scratch, one per day. Each day has three tabs: see it working, understand the algorithm, and build it step by step.

Yesterday was a resume bullet improver (regex-based Verb·What·Impact rewriting). Tomorrow: sign-language to text via webcam and MediaPipe.

If you're studying for exams and want to try this on your actual notes — drop a comment with how well it worked. I'm curious what subject it struggles with most.

推荐订阅源

DEV Community