Text Extractor

Extract emails, URLs, phone numbers, IPs, dates, hex colors, MAC addresses, credit cards, hashtags and mentions from any text. Regex-based, browser-local, no upload.

Have feedback? Report bugs, suggest features, or share your thoughts — we read them all

About Text Extractor

The Text Extractor pulls structured data out of unstructured text using carefully tuned regular expressions. Paste an invoice, an email thread, a chat log, a server output, or a scraped web page and instantly isolate every email address, link, phone number, IPv4/IPv6 address, hashtag or @mention you need. Marketers use it for lead lists, developers for log triage, researchers for citation harvesting, and support teams for ticket parsing. Everything runs locally in JavaScript so sensitive contacts never leave your machine, and you can deduplicate, sort, and case-filter results before exporting.

What regex patterns do you use for email extraction, and how accurate are they?

We use a pragmatic subset of RFC 5322 that matches roughly 99% of real-world emails while rejecting most false positives. The pattern is /[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}/g which accepts dots, plus-aliases ([email protected]), and TLDs from 2 characters (like .uk) up. It does not validate exotic forms like quoted local parts ("john doe"@example.com) or comments — those represent under 0.01% of inboxes and including them would explode the regex into something unreadable. For 100% RFC compliance you would need a proper parser, but for lead generation, log parsing, or contact harvesting this regex catches everything practical and runs in microseconds even on megabyte-sized inputs.

How do you detect international phone numbers — do you support E.164 format?

We match several formats heuristically. The primary pattern catches optional country code (+1 to +999), optional area code parentheses, and digit groups separated by spaces, dashes, dots, or nothing — covering US/Canada (123) 456-7890, European 020 7946 0958, and E.164 +442079460958. Pure E.164 is a strict ITU-T standard requiring + followed by up to 15 digits with no separators; we match it but also accept the common formatted variants people actually write in text. Be aware: this pattern will produce false positives on long numeric strings like order IDs or timestamps — always sanity-check extracted phone lists with a validator like libphonenumber if accuracy matters for billing or compliance.

Why are some URLs in my text not being extracted?

Our URL regex requires either an explicit scheme (http://, https://, ftp://) or a www. prefix. Bare domains like example.com mentioned in prose get skipped intentionally — distinguishing 'I visited example.com yesterday' (a URL) from 'check my email [email protected]' (just a domain) is impossible without context, so we err on the side of fewer false positives. Punycode IDN domains (xn--80akhbyknj4f) work. Internationalized domains in native script (例え.jp) do not in current build because their detection requires a lookup table. URLs ending in punctuation (period, comma, parenthesis) have the trailing punctuation stripped automatically, since those almost always belong to the surrounding sentence rather than the link.

Text Extractor — Extract emails, URLs, phone numbers, IPs, dates, hex colors, MAC addresses, credit cards, hashtags and mentions from any — **Text Extractor**

Is there a size limit on input text, and how fast is extraction?

Practical limit is about 10 MB of text — beyond that, browsers start to throttle the UI thread. On a typical laptop, extracting all entity types from 1 MB of mixed text takes 50-150 ms; from 10 MB takes 1-3 seconds. The bottleneck is the V8 regex engine, not memory. We run patterns sequentially rather than in parallel because Web Workers add overhead that exceeds savings for inputs under 50 MB. If you need to extract from huge corpora (GB-scale), do it server-side with grep -oE or ripgrep rather than in a browser — those tools stream the data and avoid loading it all into memory at once.

Can I extract entities the tool does not natively support, like dates or product codes?

Not yet through the UI, but you can post-process the All Numbers output with a quick regex of your own in DevTools or a spreadsheet. Common requests: ISBN-13 (978-3-16-148410-0), credit card numbers (Luhn-validated), bitcoin addresses (base58 with leading 1 or 3), social security numbers (XXX-XX-XXXX), MAC addresses (00:1A:2B:3C:4D:5E), and IBANs. We deliberately skip credit cards and SSNs to avoid creating a PII harvesting tool. If you have a specific pattern you extract often, file a feature request — adding a regex takes minutes once we know the use case is broad enough to justify a UI checkbox.

How does case-sensitive matching affect duplicate detection?

When 'Remove Duplicates' is enabled, we hash each match into a Set. With case-sensitive OFF (default), we lowercase strings first, so '[email protected]' and '[email protected]' collapse to one entry — usually what you want for emails (which are case-insensitive per RFC 5321) and domains. With case-sensitive ON, the original casing matters, which is correct for URLs (paths after the domain ARE case-sensitive on Unix servers), hashtags (#Bitcoin vs #bitcoin can mean different campaigns on Twitter), and Mentions. The toggle exists because there is no universally correct answer — emails behave one way, URL paths another, and you should match the convention of whatever system consumes your extracted list.

Why does extraction happen in the browser instead of on a server?

Three reasons. Privacy: emails, phone numbers, and IPs often qualify as PII under GDPR Article 4 — keeping them client-side means we never store, log, or process your contacts on our servers, eliminating breach risk. Speed: round-tripping text to a server adds 50-300 ms of network latency that local regex avoids entirely; for batch workflows this compounds. Cost: client-side processing scales to millions of users at zero compute cost to us, letting the tool stay free forever. The trade-off is no server-side intelligence (no ML-based entity recognition, no spelling-corrected matching) — for those use cases, paid services like Google Cloud Natural Language API or AWS Comprehend are appropriate, but for regex-style extraction, browser is faster, safer, and free.