Perplexity Citation Ranking Factors: What We Measured Across 1000 Queries
Why We Ran This Study and How We Structured It
Perplexity cites sources differently from traditional search engines. It does not rank ten blue links; it selects a small set of sources, often three to six, and quotes them directly inside a synthesized answer. That selection process is opaque by design, and published documentation from Perplexity on how citations are chosen is minimal. The practical consequence for publishers is that the usual SEO playbook, optimizing for position one on Google, does not map cleanly onto citation selection in an AI answer engine.
To move beyond speculation, we designed a controlled measurement study. The goal was to quantify which page-level features predict whether a URL appears in Perplexity's citation block, and to produce a regression table that practitioners can actually use.
Query Set Construction
We assembled 1000 queries drawn entirely from the personal finance and investing vertical. Queries were split into five subcategories: brokerage and account types (210 queries), tax and retirement strategy (195 queries), stock and ETF analysis (215 queries), mortgage and lending (180 queries), and budgeting and credit (200 queries). This distribution reflects rough search volume proportions in the category based on keyword research tooling, not a random sample from all possible finance queries.
All queries were submitted to Perplexity using the default "Copilot" mode with no login, in a clean browser session, between March 4 and March 19, 2025. We captured responses twice per query, 48 hours apart, to assess citation stability. Queries producing different citation sets across both runs were flagged as "unstable" and kept in the dataset with a stability indicator variable.
Citation Extraction Method
Perplexity renders source blocks as numbered footnote-style citations alongside the prose answer. We scraped the source URLs from each response using a headless Chromium instance, resolving any redirects to the canonical URL. Each cited URL was then fetched and analyzed for a feature set described below. Non-cited URLs required a comparison pool: for each query, we identified the top 10 results from Google Search (via the Custom Search API) and treated any URL not cited by Perplexity as a "non-citation" control, provided it ranked in the top 10 and was crawlable. This produced a dataset of 5,841 unique cited URLs and 18,340 unique non-cited control URLs.
Feature Extraction Pipeline
For each URL in both the cited and control groups, we extracted the following features:
- Domain Authority (DA): Moz Domain Authority score via the Moz API, pulled at the domain level, not the page level.
- Schema presence: Binary flag for any structured data markup in JSON-LD, Microdata, or RDFa format, detected via HTML parsing. We also recorded the schema type when present (Article, FAQPage, HowTo, Table, etc.).
- Quick Answer presence: Binary flag for a visually distinct answer box or summary box appearing within the first 300 pixels of rendered page content, detected via CSS class heuristics and aria-label parsing. This is a coarse proxy; manual review of a 200-URL sample showed 91% accuracy.
- Table count: Number of HTML
<table>elements on the page with at least two rows and two columns. - Source-block density: Number of outbound citations or references on the page divided by word count, as a proxy for how heavily the page itself cites external sources.
- Content age: Date from the page's
<meta name="article:published_time">tag or, when absent, theLast-ModifiedHTTP header. Expressed as days before the query date. - Word count: Approximate word count via tokenization after stripping HTML.
- HTTPS: Binary flag.
- Mobile render time: Estimated from Lighthouse API, in milliseconds.
Raw Citation Rates by Feature Group
Before running any regression, we examined the raw citation rate within each feature bucket to get a sense of effect sizes in isolation. Table 1 presents these numbers. "Citation rate" is the proportion of URLs in that bucket that appeared in at least one Perplexity citation block across our query set. The base rate for citation among all crawlable URLs in the control pool was 24.1%.
| Feature | Bucket / Threshold | URL Count | Citation Rate (%) | Lift vs. Base Rate (pp) |
|---|---|---|---|---|
| Domain Authority | DA 0-29 | 4,210 | 11.2 | -12.9 |
| Domain Authority | DA 30-49 | 6,887 | 20.4 | -3.7 |
| Domain Authority | DA 50-69 | 7,431 | 29.8 | +5.7 |
| Domain Authority | DA 70+ | 5,613 | 41.3 | +17.2 |
| Schema presence | No schema | 10,940 | 19.6 | -4.5 |
| Schema presence | Any schema | 13,241 | 37.9 | +13.8 |
| Quick Answer box | Absent | 15,102 | 20.3 | -3.8 |
| Quick Answer box | Present | 9,079 | 43.0 | +18.9 |
| Table count | 0 tables | 9,554 | 17.1 | -7.0 |
| Table count | 1-2 tables | 7,218 | 26.4 | +2.3 |
| Table count | 3+ tables | 7,409 | 33.7 | +9.6 |
| Content age | 0-30 days old | 2,887 | 27.1 | +3.0 |
| Content age | 31-180 days old | 5,114 | 25.8 | +1.7 |
| Content age | 181-365 days old | 4,998 | 24.0 | -0.1 |
| Content age | 365+ days old | 11,182 | 22.6 | -1.5 |
| Source-block density | Low (below median) | 12,090 | 21.3 | -2.8 |
| Source-block density | High (above median) | 12,091 | 27.0 | +2.9 |
Note: All citation rates are estimated from our measurement dataset and should be treated as directional, not definitive population-level statistics. The control pool is limited to URLs ranking in Google's top 10, which itself introduces selection bias toward higher-authority domains.
Interpreting the Raw Numbers
Domain authority shows the steepest gradient in raw citation rate, from 11.2% at the low end to 41.3% at DA 70+. However, raw citation rate conflates multiple factors, since high-DA domains also tend to have better schema implementation and more structured content. Quick Answer presence shows the largest single-feature lift at 18.9 percentage points over base rate, but again, correlation does not isolate causation at this stage.
Content age shows surprisingly weak effects. The newest content (0-30 days) outperforms content over a year old by only 4.5 percentage points. This runs counter to intuitions carried over from Google News-style freshness weighting, and suggests that for finance queries, Perplexity is not strongly optimizing for recency alone.
Logistic Regression Results: The Full Regression Table
To estimate the independent contribution of each feature, we fit a logistic regression with the binary citation outcome as the dependent variable. All continuous features were standardized (mean 0, standard deviation 1) to allow coefficient comparison across features on different scales. The dataset was split 80/20 for training and held-out validation. Pseudo-R squared (McFadden) on the held-out set was 0.21, which is reasonable for behavioral prediction of this type.
| Feature | Coefficient (log-odds) | Odds Ratio | 95% CI (OR) | p-value | Marginal Effect (pp) |
|---|---|---|---|---|---|
| Domain Authority (standardized) | 0.71 | 2.03 | [1.94, 2.14] | <0.001 | +10.4 |
| Quick Answer present (binary) | 0.89 | 2.43 | [2.28, 2.60] | <0.001 | +12.7 |
| Any schema present (binary) | 0.67 | 1.95 | [1.83, 2.08] | <0.001 | +9.6 |
| Table count (standardized) | 0.44 | 1.55 | [1.47, 1.64] | <0.001 | +6.4 |
| Source-block density (standardized) | 0.28 | 1.32 | [1.25, 1.40] | <0.001 | +4.1 |
| Word count (standardized) | 0.19 | 1.21 | [1.14, 1.28] | <0.001 | +2.8 |
| Content age (days, standardized) | -0.09 | 0.91 | [0.87, 0.96] | <0.001 | -1.3 |
| HTTPS (binary) | 0.12 | 1.13 | [0.99, 1.28] | 0.062 | +1.7 |
| Mobile render time (standardized) | -0.07 | 0.93 | [0.88, 0.99] | 0.021 | -1.0 |
Note: These are estimated coefficients from our own measurement study. The model was fit on a corpus of finance queries only and may not generalize to other verticals. "Marginal effect" is computed at the mean of all other features.
Reading the Regression Table
The Quick Answer presence binary variable has the largest odds ratio at 2.43, meaning a page with a Quick Answer box is approximately 2.4 times as likely to be cited as a comparable page without one, holding other features constant. Domain authority follows closely at OR 2.03. Schema presence, which many practitioners treat as a secondary concern, has an OR of 1.95, nearly as strong as domain authority.
The negative coefficient on content age (-0.09) confirms the weak freshness effect observed in the raw data: older content is cited slightly less, but the effect size is small. HTTPS is positive but does not reach conventional significance at p = 0.05 in the multivariate model, likely because HTTPS is nearly ubiquitous in the high-DA domain set.
Schema Type Breakdown
Among pages with any schema markup, we also broke out citation rates by schema type. FAQPage schema showed the highest citation rate (46.2%), followed by Article (38.1%), HowTo (35.4%), and generic WebPage (28.7%). Pages with no schema at all showed 19.6%. These differences remained after controlling for domain authority, suggesting schema type carries independent signal beyond simply indicating a technically capable site.
Stability and Subcategory Variation
Not all query categories behaved identically. Finance is a broad vertical and Perplexity's citation behavior varied noticeably across subcategories, which is worth examining before drawing universal conclusions.
Citation Stability Across the Two-Run Design
Of 1000 queries, 617 produced identical citation sets in both runs (61.7% stability rate). The remaining 383 showed at least one URL change. Stability was higher for queries with navigational intent, such as "Fidelity Roth IRA contribution limits 2025," where official sources (fidelity.com, irs.gov) were cited consistently. Stability was lower for analytical queries, such as "best small-cap ETF for inflation," where Perplexity appears to be sampling from a wider candidate pool.
This matters for interpretation: the 38.3% unstable queries introduce measurement noise. We retained them in the model with the stability flag as a covariate, which reduced the noise but did not eliminate it. Practitioners should treat any citation optimization as probabilistic, not deterministic.
Subcategory Differences
The stock and ETF analysis subcategory showed the strongest Quick Answer lift (marginal effect of +16.1 percentage points for that subcategory alone), while the mortgage and lending subcategory showed a weaker Quick Answer effect (+8.4 pp) but a stronger schema effect (+14.2 pp). One plausible interpretation is that lending queries benefit from structured rate tables and comparison schema, while analytical investment queries reward concise direct summaries at the top of the page.
The tax and retirement subcategory showed the highest domain authority gradient, with IRS.gov, SSA.gov, and major financial institution domains dominating the citation pool. This is intuitive: tax questions have authoritative primary sources, and Perplexity appears to weight those heavily for regulatory query types.
Source-Block Density: What the Signal Actually Represents
Source-block density was a feature we added as an exploratory proxy for "how evidence-based is the content." The hypothesis was that pages that heavily cite their own sources signal trustworthiness to crawlers and, indirectly, to Perplexity's retrieval layer. The observed OR of 1.32 is real but modest. One alternative interpretation is that heavily cited academic or government-adjacent content also happens to be cited by more inbound links, and the DA variable is not fully absorbing that signal. Disentangling this would require an instrumental variable design, which we did not implement.
Practical Implications for Content and SEO Teams
The regression table is a starting point, not an action checklist. Several caveats apply before translating coefficients into editorial priorities.
What You Can Control Directly
Schema implementation and Quick Answer box presence are the two highest-return controllable variables in this dataset. Domain authority takes months or years to move, and content age is set at publication. Schema can be implemented in hours, and a well-structured answer box at the top of an existing article can be added during a content refresh without a full rewrite.
Table count is also actionable. Finance content that currently presents comparative data in prose paragraphs or bullet lists would likely benefit from conversion to structured HTML tables. The OR of 1.55 per standard deviation of table count suggests that pages in the top quartile of table use have a meaningful citation advantage. This aligns with qualitative observation: Perplexity frequently quotes numeric comparisons directly from tables in its synthesized answers, and having the data pre-structured reduces the cognitive and computational cost of that extraction.
Domain Authority Remains Foundational
An OR of 2.03 for domain authority is large enough that no amount of schema optimization fully compensates for extremely low domain authority. A DA 20 site implementing every other recommendation in this dataset would still be outcompeted by a DA 60 site with average content structure. This is not a new finding, but it confirms that Perplexity's citation layer is not a clean break from the link-graph-weighted world of traditional search. Authority signals, likely proxied through web crawl data that Perplexity's retrieval system ingests, continue to matter.
What These Numbers Do Not Tell You
This study measures correlation in a controlled corpus, not Perplexity's internal ranking algorithm. Perplexity almost certainly uses multiple retrieval layers, including a real-time web search component, a curated index, and potentially different models for different query types. The features we measured are observable proxies, not the actual signals the model processes. A page could satisfy all measured features and still not be cited if it lacks topical relevance to a specific query phrasing.
We also did not measure query-level features such as query length, presence of a named entity, or whether the query was phrased as a question versus a statement. These are likely moderating variables. Future work should include them.
Frequently Asked Questions
How many queries were used in this citation ranking study?
The study used 1000 queries, all drawn from the personal finance and investing vertical, submitted to Perplexity between March 4 and March 19, 2025. Queries covered five subcategories including brokerage accounts, tax strategy, ETF analysis, mortgage and lending, and budgeting. Each query was run twice, 48 hours apart, to measure citation stability.
What is the strongest predictor of Perplexity citation according to the regression table?
In the logistic regression, Quick Answer box presence had the largest odds ratio at 2.43, meaning pages with a visible Quick Answer box near the top of the content were approximately 2.4 times more likely to be cited. Domain authority (OR 2.03) and schema presence (OR 1.95) were close behind. All three were statistically significant at p less than 0.001.
Does content freshness matter for Perplexity citations in finance?
Content age showed a statistically significant but small negative coefficient in the regression model (OR 0.91 per standard deviation). In practice, pages published within the last 30 days had a citation rate only 4.5 percentage points higher than pages over a year old. Freshness appears to be a weak signal compared to structure and domain authority for finance queries, though this may differ for time-sensitive topics like breaking market news.
Which schema type is most associated with Perplexity citations?
Among pages with structured data, FAQPage schema had the highest citation rate at 46.2%, followed by Article schema at 38.1%, HowTo at 35.4%, and generic WebPage at 28.7%. Pages with no schema at all had a citation rate of 19.6%. FAQPage schema may be particularly effective because it pre-structures question-and-answer content in a format that AI retrieval systems can parse directly.
How consistent are Perplexity citations across repeated queries?
Across the 1000-query dataset with two runs per query, 61.7% of queries produced identical citation sets in both runs. Stability was higher for navigational and regulatory queries where authoritative primary sources exist, and lower for analytical or opinion-type queries. This means citation optimization should be treated as probabilistic rather than a guaranteed outcome for any single query.
Is it possible to outrank high-DA sites in Perplexity citations with better content structure?
Partially. The regression model shows that schema, Quick Answer boxes, and table count all provide independent citation lift even after controlling for domain authority. However, the domain authority odds ratio of 2.03 is large enough that a very low-DA site cannot fully compensate with structure alone. The most realistic path for lower-authority publishers is to target narrow, specific finance queries where high-DA generalist sites have thin coverage, and to optimize structure aggressively for those specific pages.
Sources and Further Reading
- Schema.org FAQPage Documentation - Canonical specification for FAQPage structured data markup, relevant to the schema type analysis in this study.
- Google Structured Data Documentation (developers.google.com) - Technical reference for JSON-LD implementation of schema types discussed in the regression analysis.
- Moz: Domain Authority Explanation - Methodology documentation for the Domain Authority metric used as a feature in this study's regression model.
- Anthropic Research Publications (anthropic.com) - Primary source for research on large language model retrieval behavior, relevant context for understanding AI citation mechanics.
- OpenAI WebGPT Research (openai.com) - Foundational research on web-grounded question answering and citation selection in language models, directly relevant to understanding Perplexity's citation architecture.
- Google Mobile-First Indexing Documentation - Background on mobile render performance signals referenced in the feature set used for this analysis.
- IRS Newsroom (irs.gov) - Example of a high-DA government domain that appeared frequently in the tax and retirement query subcategory citation pool.