Remove Duplicate Lines

Deduplicate any text list, log, CSV, or email list in one click. Keeps first occurrence, optional sort, case-insensitive matching, and removed-lines view.

Have feedback? Report bugs, suggest features, or share your thoughts — we read them all

Remove Duplicate Lines - Text Deduplication Tool

Remove Duplicate Lines is a line-level deduplicator equivalent to the Unix `awk '!seen[$0]++'` idiom or `sort -u`, but with a UI, optional case-folding, optional empty-line stripping, and a side panel showing exactly which duplicates were dropped. The algorithm builds a JavaScript Set of canonicalised line keys (lowercased if you toggle off case sensitivity) and walks the input once in original order, keeping only the first time each unique line appears. This 'first-occurrence-wins' policy is important: unlike `sort -u` which reorders, this tool preserves your input order unless you explicitly enable Sort. Typical use cases: dedupe email recipient lists harvested from multiple newsletters, clean exported customer CSV rows where the same record was logged twice, remove repeated error messages from log files to find unique fault patterns, consolidate hostname lists for Ansible inventories, and deduplicate translation strings before merging into a localization file.

What exactly counts as a 'duplicate line' — does whitespace and case matter?

By default, two lines are considered duplicates if their character sequences match exactly — leading and trailing spaces count, and case matters ('Apple' is different from 'apple'). Toggling the case option to UPPER or lower normalises both compared strings before matching, so 'APPLE', 'Apple', and 'apple' collapse to one line. To also ignore whitespace differences (' a' vs 'a '), pre-process with a Text Cleaner or trim each line first.

Which copy is kept when there are duplicates — first or last occurrence?

Always the first occurrence. The dedup walker uses a Set that records line content the first time it appears, and skips every subsequent identical line. This matters when your input has ordering significance — e.g., a CSV where row 1 is the canonical record and row 7 is a stale duplicate import. If you instead need the last occurrence kept (common in 'last-write-wins' merges), reverse the list first with the Reverse List tool, dedupe, then reverse back.

How does this differ from Unix `sort -u`, `awk`, or `uniq`?

`uniq` only collapses adjacent duplicates and requires sorted input — non-adjacent duplicates survive. `sort -u` sorts and dedupes but destroys original order. `awk '!seen[$0]++'` is order-preserving deduplication and matches what this tool does, but requires a terminal. Excel's 'Remove Duplicates' works similarly but is limited by row count and locks you to one platform. This tool gives the same result as the awk approach with a GUI, plus a removed-lines panel that none of the command-line tools surface.

Remove Duplicate Lines — Deduplicate any text list, log, CSV, or email list in one click. Keeps first occurrence, optional sort, case-insensitive — **Remove Duplicate Lines**

What does the 'Display removed' option actually show?

It outputs a separate panel listing every duplicate line that was skipped, in the order they were encountered. Each entry shows the content and (where helpful) which input line number it sat on. This audit trail is critical for compliance scenarios — say cleaning a customer database under GDPR — so you can prove that a specific row was a duplicate and not silently lost data. It also helps debug case-sensitivity surprises ('[email protected]' vs '[email protected]').

Does it scale to large files like a 100,000-row CSV?

Yes. The Set-based deduplication is O(n) average time and ~O(n) memory, so a 100k-line list typically dedupes in under 100 ms on a modern laptop. The browser textarea is the bottleneck — beyond ~5 MB of pasted text the paste itself can lag, but the dedupe itself stays fast. For multi-million-row files use Unix: `awk '!seen[$0]++' input.txt > output.txt` streams without loading the full file into RAM, and handles arbitrarily large inputs.

Will the 'Remove empty lines' option strip whitespace-only rows too?

Yes. When enabled, lines that are entirely empty or contain only whitespace characters (spaces, tabs, non-breaking spaces) are dropped before deduplication runs. This is useful because blank rows in CSV data often duplicate each other (every empty row looks identical), inflating your 'duplicates removed' count without removing real content. Disable this option if you want to keep blank separator lines between sections.

Is my data uploaded or stored anywhere?

No. The deduplication runs entirely in browser JavaScript on the textarea value — there is no fetch() to a backend, no analytics event with content, no localStorage write. You can verify in DevTools Network tab that clicking Remove makes zero outbound requests. This makes the tool safe for sensitive lists like employee emails, customer records, internal hostnames, or copyrighted content under NDA.