The single biggest cost-control mistake we see in ADA Title II remediation projects is starting before you know your scope. The natural instinct is to ask a vendor for a quote, get a number, and react. The vendor's number is sized to their pricing model, not your actual corpus — and the pricing model assumes a corpus size, which is usually wrong.
Before you ask for vendor quotes, get your own number. Here's how, and what to expect.
Real numbers from California agencies
The page counts below are from corpus audits we've completed for California public entities. Specific names are anonymized, but the numbers are real.
| Entity Type | Population | Staff Estimate | Actual Pages | Multiplier |
|---|---|---|---|---|
| Special District (water) | n/a | ~5,000 | 34,200 | 6.8x |
| Small City | ~12,000 | ~10,000 | 61,500 | 6.2x |
| Mid-size City | ~85,000 | ~50,000 | 312,000 | 6.2x |
| Large City | ~280,000 | ~200,000 | 1,140,000 | 5.7x |
| County | ~700,000 | ~400,000 | 2,300,000 | 5.8x |
| Housing Authority | (serves ~30,000) | ~3,000 | 22,800 | 7.6x |
| School District (K-12) | (15,000 students) | ~25,000 | 180,000 | 7.2x |
The 5–8x multiplier is consistent. When agency staff estimate their corpus from memory, they remember the documents they actively work with: current agendas, recent staff reports, the active permit forms. They forget the seven-year archive of capital improvement plans, the fifteen-year archive of council resolutions, the thirty-year archive of public notices, and every PDF anyone ever attached to a public-records-request response.
Why the underestimate happens
Three structural reasons:
- Documents accumulate silently. Every department has been adding PDFs to its section of the website for years. Nobody has the cross-departmental view.
- The CMS hides the count. Most agency websites don't surface a PDF inventory. Documents are linked from individual pages, attached to news items, embedded in event listings. There's no single list.
- The "page" unit isn't intuitive. Staff count documents, not pages. A 250-page board packet is one document — but for compliance and pricing, it's 250 units of work.
Five methods to count, in order of accuracy
Method 1: Web crawl (best)
Crawl every public URL on your website with a tool that detects PDF links. Download each PDF, extract the page count from PDF metadata, sum the totals. This produces the actual number, deduplicated by URL.
Tools: Screaming Frog (commercial), wget + pdfinfo (free, technical), or a vendor that runs the crawl for you (e.g., SentraCheck's free corpus audit). Expect 1–5 days to complete on a normal-sized agency site.
Method 2: CMS export
If your website runs on Drupal, WordPress, Sitecore, or similar, query the database for all uploaded files of type PDF. This catches every document in the CMS but misses documents linked from external systems (e.g., agendas posted to a separate civic-engagement platform).
Caveat: CMS exports tend to overcount because they include unpublished drafts, deleted-but-undeleted files, and revisions.
Method 3: Sitemap walk
Parse your sitemap.xml, follow every URL, count PDFs linked from each page. Faster than a full crawl. Misses anything not in the sitemap (which is often ~30% of the public-facing site).
Method 4: File-system audit
If your IT department has direct access to the web server file system, walk the documents directory and sum page counts. Catches files even if they're not currently linked from any page (which means they may not be public-facing — verify).
Method 5: Department survey
Email each department head asking for their PDF count. Sum results. This is the worst method. Department staff radically underestimate. Use only as a sanity check against the actual number.
What to do once you have a number
With a real corpus number in hand, you can:
- Classify by archive eligibility. Typically 20–40% of pages can be moved to a properly labeled archive section, removing them from the remediation scope.
- Identify duplicates. SHA-256 deduplication routinely removes 30–40% of pages. The same financial report might be linked from five different program pages; you only need to remediate it once.
- Prioritize by traffic. Use Google Analytics to find your top 100 most-visited PDFs. These produce 80%+ of accessibility complaints; remediate them first.
- Get accurate quotes. Vendors quoting per-page can give you a real number. Vendors quoting flat rates can be evaluated against actual scope.
Practical math example
A mid-size city with 312,000 actual pages typically breaks down like this after audit:
- Archive-eligible: ~75,000 pages (24%)
- Duplicates removed: ~95,000 pages (30%)
- Remaining for remediation: ~142,000 pages (46%)
At per-page manual remediation rates ($10/page average), the unaudited corpus would cost $3.1M to remediate. The audited corpus costs $1.4M. The audit is free. The savings exist because you stopped paying to remediate things that didn't need it.
The audit-first approach is the single biggest cost lever in ADA Title II compliance. Agencies that audit first spend 50–70% less on remediation than agencies that start remediating before knowing scope. The audit takes a week. The savings take years to recover otherwise.
Why we offer this audit free
We offer a free corpus audit for California government agencies because the audit is the natural entry point to a working SentraCheck relationship. If after the audit you decide we're the right vendor, we already have your inventory. If you choose another vendor, you walk away with the data anyway. We'd rather you have an accurate scope — even with someone else — than an inflated one we benefit from.