Word Frequency Counter
Count word frequency, bigram & trigram phrases, and check keyword density with a Sparse/Optimal/Over-optimized verdict. Export to CSV, JSON, or TXT.
About Word Frequency Counter Tool
The Word Frequency Counter is a powerful text analysis tool that helps you identify the most frequently used words in any text. Perfect for writers, researchers, SEO specialists, and data analysts who need to analyze word patterns, identify overused words, or study vocabulary distribution. The tool offers advanced filtering options including stop word removal, case sensitivity, punctuation handling, and customizable minimum word length.
What's the difference between word count and character count for SEO?
Word count measures discrete linguistic units separated by whitespace; character count measures every glyph including spaces, punctuation, and accents. For SEO, both matter but at different layers. Google uses word count loosely as a content-depth signal — articles ranking on competitive queries average 1,500-2,500 words because long-form tends to be more comprehensive, but length alone is not a ranking factor. Character count dominates SERP-display elements: title tags truncate around 60 characters, meta descriptions around 155-160. Twitter/X cards cap at 280, Open Graph descriptions display about 200. This tool counts words for content-depth analysis; for SERP-snippet limits, use a character counter. Pro tip: aim for the lowest word count that fully answers user intent — content bloat hurts engagement metrics that do affect rankings.
What are stop words, and should I always remove them from frequency analysis?
Stop words are the most common function words in a language — English a, an, the, is, of, to, in, that, it; Spanish el, la, de, en, que; French le, la, de, est, en; Portuguese o, a, de, em, que; Vietnamese là, của, và, một, các. They carry little topical meaning and would dominate any frequency list, drowning out the words that actually distinguish your text. For SEO keyword research, content theming, and topic modeling, remove them. But for stylometry (authorship attribution), translation analysis, or linguistic research, stop words are critical — they reveal syntactic patterns that vary by author and dialect. This tool's stop-word filter uses a default list per language; you can disable it when you need every token.
How do tokenizers split Vietnamese, Chinese, and Japanese text that has no spaces between words?
Whitespace tokenization works well for English, Spanish, French, and Portuguese where spaces separate words. But Vietnamese, despite using Latin script with spaces, often has compound words like "học sinh" (student) that span two whitespace-separated syllables — splitting on space produces "học" and "sinh" as separate tokens, distorting frequency. Chinese and Japanese have no inter-word spaces at all. Proper tokenization requires dictionary-based segmenters: pyvi or underthesea for Vietnamese, jieba for Chinese, MeCab for Japanese. This frequency counter uses whitespace tokenization, which is accurate for Western languages and approximate for Vietnamese (syllable-level). For Chinese or Japanese, preprocess with a segmenter and paste the space-separated result.
How do I find the most distinctive keywords using TF-IDF instead of raw frequency?
Raw frequency tells you which words appear most in one document, but the most-frequent words are often universal stopwords or generic terms. TF-IDF (Term Frequency-Inverse Document Frequency) weights each word by how unique it is across a corpus: words that appear frequently in this document but rarely in the broader corpus get the highest scores. The formula is TF × log(N / DF), where TF is the word count in this doc, N is total documents, and DF is the number of documents containing the word. To use this tool for TF-IDF: run frequency on each document, then for each word divide its count by the number of documents in your corpus that contain it. Words with high distinctiveness become candidate keywords for that specific document.
Should I normalize words (stemming, lemmatization) before counting frequency?
Counting raw word forms treats "run," "runs," "running," and "ran" as four separate tokens, which often misrepresents the topic. Normalization collapses them. Stemming (e.g., Porter, Snowball) chops suffixes mechanically: "running" → "run," but also "university" → "univers." Lemmatization (e.g., WordNet, spaCy) uses dictionaries to find canonical forms: "better" → "good," "running" → "run." Lemmatization is more accurate but slower. For SEO and content analysis, lemmatization gives a truer picture of topical coverage. For Spanish, Portuguese, and French — heavily inflected languages — normalization is essential or counts will be fragmented. This tool counts surface forms; preprocess with a stemmer if you need normalized counts.

What's a good word frequency distribution for natural-sounding content?
Natural language follows Zipf's law: the n-th most frequent word appears about 1/n times as often as the most frequent. Plotted on log-log axes, this is a straight line. Healthy content shows: top stopword around 5-7% of total tokens, top content word 0.5-2%, long tail of words appearing once (hapax legomena) making up 40-50% of unique vocabulary. Red flags: any single content word above 3% suggests keyword stuffing, which can trigger Google's spam filters. Repetitive AI-generated text often shows a flatter distribution and fewer hapax legomena than human writing. Use this tool to spot over-used keywords, and aim for keyword density in the 0.5-2% range for primary terms and 0.2-0.5% for secondary.
What are bigrams and trigrams, and why count phrases instead of single words?
An n-gram is a contiguous run of n words: a bigram is a 2-word phrase ("machine learning"), a trigram a 3-word phrase ("natural language processing"). Single-word frequency tells you which terms recur, but it scatters multi-word concepts — "learning" might rank high without revealing that "machine learning" is the actual theme. Use the Phrase Length (N-gram) selector in this tool to count bigrams and trigrams: it surfaces collocations, branded phrases, and long-tail keyword targets that single-word counts hide. Bigram/trigram analysis is the fastest way to extract candidate long-tail keywords for SEO, spot repetitive filler phrases in drafts, and check whether a target key-phrase actually appears at the density you intend. Note that the keyword-density verdict (Sparse/Optimal/Over-optimized) applies to single keywords; for phrases, read the raw count and percentage instead, since the 0.5-3% density thresholds are defined for single terms.
How do I read the Sparse / Optimal / Over-optimized density verdict?
In single-word (unigram) mode this tool tags each term with a keyword-density verdict so you do not have to do the math by hand. The thresholds follow standard SEO guidance: a primary content keyword sitting at 0.5-2% (we allow up to 3%) reads as Optimal — frequent enough to signal topical focus, not so frequent it looks manipulated. Below 0.5% is Sparse: the term may be under-used relative to your target intent, so consider weaving it in more. Above 3% is flagged Over-optimized, the classic keyword-stuffing red flag that can trip Google's spam filters and hurt readability. The summary line under the table reports your highest-density keyword and raises an overall stuffing-risk warning when any content word crosses 3%. Treat it as a fast pass/fail check, then adjust copy and re-run. The verdict travels with your CSV, JSON, and TXT exports for compliance-style reporting.
How does word frequency analysis compare to embedding-based topic modeling?
Word frequency is a bag-of-words approach — it ignores order, syntax, and semantic similarity. "Big dog bit man" and "Man bit big dog" have identical frequency profiles. Modern topic modeling uses word embeddings (Word2Vec, GloVe, sentence-BERT) that map words and sentences into vector spaces where semantically related items cluster. Embeddings can group "car," "auto," and "vehicle" as one concept, where frequency counts them as three. For deep semantic analysis, run sentence embeddings through k-means or HDBSCAN clustering. For quick lexical exploration, keyword research, and editorial review, raw word frequency remains the fastest, most interpretable signal. They complement each other rather than compete.
How does subword tokenization in LLMs (BPE, SentencePiece) affect frequency analysis for prompt design?
Large language models do not see whole words — they see subword tokens produced by Byte-Pair Encoding (BPE) or SentencePiece. "Tokenizers" might split as "Token," "izer," "s," while "colonoscopy" might be "colon," "oscopy." Common words become single tokens; rare or non-English words fragment into many. This matters for cost (APIs bill per token), context windows (a 4k-token limit fits only ~3,000 English words but as few as 1,500 Vietnamese words due to diacritic encoding), and frequency analysis on prompts. To estimate your prompt's true token count, use OpenAI's tiktoken library or Anthropic's tokenizer. This word counter is fine for content drafting; switch to a token counter when optimizing prompts for cost or context limits.
Example Word Frequency Analysis
| Input Text | Top 3 Words | Total Words | Unique Words |
|---|---|---|---|
| The quick brown fox jumps over the lazy dog | the (2), quick (1), brown (1) | 9 | 8 |
| Hello world! Hello everyone in this world. | hello (2), world (2), everyone (1) | 7 | 5 |
| Data analysis is important. Analysis helps. | analysis (2), data (1), important (1) | 6 | 5 |
