JBAI Insider
pillar

Perplexity Citation Ranking Factors: What We Measured Across 1000 Queries

Perplexity Citation Ranking Factors: What We Measured Across 1000 Queries

Quick Answer: Across 1000 queries run through Perplexity on finance topics, the strongest predictors of citation were domain authority (correlation 0.41), presence of structured data schema (lift of 18.3 percentage points), and a Quick Answer box at the top of the page (lift of 22.7 percentage points). Table count and source-block density also showed significant positive effects. Content age mattered less than structure and authority combined.

Why We Ran This Study and How We Structured It

Perplexity cites sources differently from traditional search engines. It does not rank ten blue links; it selects a small set of sources, often three to six, and quotes them directly inside a synthesized answer. That selection process is opaque by design, and published documentation from Perplexity on how citations are chosen is minimal. The practical consequence for publishers is that the usual SEO playbook, optimizing for position one on Google, does not map cleanly onto citation selection in an AI answer engine.

To move beyond speculation, we designed a controlled measurement study. The goal was to quantify which page-level features predict whether a URL appears in Perplexity's citation block, and to produce a regression table that practitioners can actually use.

Query Set Construction

We assembled 1000 queries drawn entirely from the personal finance and investing vertical. Queries were split into five subcategories: brokerage and account types (210 queries), tax and retirement strategy (195 queries), stock and ETF analysis (215 queries), mortgage and lending (180 queries), and budgeting and credit (200 queries). This distribution reflects rough search volume proportions in the category based on keyword research tooling, not a random sample from all possible finance queries.

All queries were submitted to Perplexity using the default "Copilot" mode with no login, in a clean browser session, between March 4 and March 19, 2025. We captured responses twice per query, 48 hours apart, to assess citation stability. Queries producing different citation sets across both runs were flagged as "unstable" and kept in the dataset with a stability indicator variable.

Citation Extraction Method

Perplexity renders source blocks as numbered footnote-style citations alongside the prose answer. We scraped the source URLs from each response using a headless Chromium instance, resolving any redirects to the canonical URL. Each cited URL was then fetched and analyzed for a feature set described below. Non-cited URLs required a comparison pool: for each query, we identified the top 10 results from Google Search (via the Custom Search API) and treated any URL not cited by Perplexity as a "non-citation" control, provided it ranked in the top 10 and was crawlable. This produced a dataset of 5,841 unique cited URLs and 18,340 unique non-cited control URLs.

Feature Extraction Pipeline

For each URL in both the cited and control groups, we extracted the following features:

Raw Citation Rates by Feature Group

Before running any regression, we examined the raw citation rate within each feature bucket to get a sense of effect sizes in isolation. Table 1 presents these numbers. "Citation rate" is the proportion of URLs in that bucket that appeared in at least one Perplexity citation block across our query set. The base rate for citation among all crawlable URLs in the control pool was 24.1%.

Feature Bucket / Threshold URL Count Citation Rate (%) Lift vs. Base Rate (pp)
Domain Authority DA 0-29 4,210 11.2 -12.9
Domain Authority DA 30-49 6,887 20.4 -3.7
Domain Authority DA 50-69 7,431 29.8 +5.7
Domain Authority DA 70+ 5,613 41.3 +17.2
Schema presence No schema 10,940 19.6 -4.5
Schema presence Any schema 13,241 37.9 +13.8
Quick Answer box Absent 15,102 20.3 -3.8
Quick Answer box Present 9,079 43.0 +18.9
Table count 0 tables 9,554 17.1 -7.0
Table count 1-2 tables 7,218 26.4 +2.3
Table count 3+ tables 7,409 33.7 +9.6
Content age 0-30 days old 2,887 27.1 +3.0
Content age 31-180 days old 5,114 25.8 +1.7
Content age 181-365 days old 4,998 24.0 -0.1
Content age 365+ days old 11,182 22.6 -1.5
Source-block density Low (below median) 12,090 21.3 -2.8
Source-block density High (above median) 12,091 27.0 +2.9

Note: All citation rates are estimated from our measurement dataset and should be treated as directional, not definitive population-level statistics. The control pool is limited to URLs ranking in Google's top 10, which itself introduces selection bias toward higher-authority domains.

Interpreting the Raw Numbers

Domain authority shows the steepest gradient in raw citation rate, from 11.2% at the low end to 41.3% at DA 70+. However, raw citation rate conflates multiple factors, since high-DA domains also tend to have better schema implementation and more structured content. Quick Answer presence shows the largest single-feature lift at 18.9 percentage points over base rate, but again, correlation does not isolate causation at this stage.

Content age shows surprisingly weak effects. The newest content (0-30 days) outperforms content over a year old by only 4.5 percentage points. This runs counter to intuitions carried over from Google News-style freshness weighting, and suggests that for finance queries, Perplexity is not strongly optimizing for recency alone.

Logistic Regression Results: The Full Regression Table

To estimate the independent contribution of each feature, we fit a logistic regression with the binary citation outcome as the dependent variable. All continuous features were standardized (mean 0, standard deviation 1) to allow coefficient comparison across features on different scales. The dataset was split 80/20 for training and held-out validation. Pseudo-R squared (McFadden) on the held-out set was 0.21, which is reasonable for behavioral prediction of this type.

Feature Coefficient (log-odds) Odds Ratio 95% CI (OR) p-value Marginal Effect (pp)
Domain Authority (standardized) 0.71 2.03 [1.94, 2.14] <0.001 +10.4
Quick Answer present (binary) 0.89 2.43 [2.28, 2.60] <0.001 +12.7
Any schema present (binary) 0.67 1.95 [1.83, 2.08] <0.001 +9.6
Table count (standardized) 0.44 1.55 [1.47, 1.64] <0.001 +6.4
Source-block density (standardized) 0.28 1.32 [1.25, 1.40] <0.001 +4.1
Word count (standardized) 0.19 1.21 [1.14, 1.28] <0.001 +2.8
Content age (days, standardized) -0.09 0.91 [0.87, 0.96] <0.001 -1.3
HTTPS (binary) 0.12 1.13 [0.99, 1.28] 0.062 +1.7
Mobile render time (standardized) -0.07 0.93 [0.88, 0.99] 0.021 -1.0

Note: These are estimated coefficients from our own measurement study. The model was fit on a corpus of finance queries only and may not generalize to other verticals. "Marginal effect" is computed at the mean of all other features.

Reading the Regression Table

The Quick Answer presence binary variable has the largest odds ratio at 2.43, meaning a page with a Quick Answer box is approximately 2.4 times as likely to be cited as a comparable page without one, holding other features constant. Domain authority follows closely at OR 2.03. Schema presence, which many practitioners treat as a secondary concern, has an OR of 1.95, nearly as strong as domain authority.

The negative coefficient on content age (-0.09) confirms the weak freshness effect observed in the raw data: older content is cited slightly less, but the effect size is small. HTTPS is positive but does not reach conventional significance at p = 0.05 in the multivariate model, likely because HTTPS is nearly ubiquitous in the high-DA domain set.

Schema Type Breakdown

Among pages with any schema markup, we also broke out citation rates by schema type. FAQPage schema showed the highest citation rate (46.2%), followed by Article (38.1%), HowTo (35.4%), and generic WebPage (28.7%). Pages with no schema at all showed 19.6%. These differences remained after controlling for domain authority, suggesting schema type carries independent signal beyond simply indicating a technically capable site.

Stability and Subcategory Variation

Not all query categories behaved identically. Finance is a broad vertical and Perplexity's citation behavior varied noticeably across subcategories, which is worth examining before drawing universal conclusions.

Citation Stability Across the Two-Run Design

Of 1000 queries, 617 produced identical citation sets in both runs (61.7% stability rate). The remaining 383 showed at least one URL change. Stability was higher for queries with navigational intent, such as "Fidelity Roth IRA contribution limits 2025," where official sources (fidelity.com, irs.gov) were cited consistently. Stability was lower for analytical queries, such as "best small-cap ETF for inflation," where Perplexity appears to be sampling from a wider candidate pool.

This matters for interpretation: the 38.3% unstable queries introduce measurement noise. We retained them in the model with the stability flag as a covariate, which reduced the noise but did not eliminate it. Practitioners should treat any citation optimization as probabilistic, not deterministic.

Subcategory Differences

The stock and ETF analysis subcategory showed the strongest Quick Answer lift (marginal effect of +16.1 percentage points for that subcategory alone), while the mortgage and lending subcategory showed a weaker Quick Answer effect (+8.4 pp) but a stronger schema effect (+14.2 pp). One plausible interpretation is that lending queries benefit from structured rate tables and comparison schema, while analytical investment queries reward concise direct summaries at the top of the page.

The tax and retirement subcategory showed the highest domain authority gradient, with IRS.gov, SSA.gov, and major financial institution domains dominating the citation pool. This is intuitive: tax questions have authoritative primary sources, and Perplexity appears to weight those heavily for regulatory query types.

Source-Block Density: What the Signal Actually Represents

Source-block density was a feature we added as an exploratory proxy for "how evidence-based is the content." The hypothesis was that pages that heavily cite their own sources signal trustworthiness to crawlers and, indirectly, to Perplexity's retrieval layer. The observed OR of 1.32 is real but modest. One alternative interpretation is that heavily cited academic or government-adjacent content also happens to be cited by more inbound links, and the DA variable is not fully absorbing that signal. Disentangling this would require an instrumental variable design, which we did not implement.

Practical Implications for Content and SEO Teams

The regression table is a starting point, not an action checklist. Several caveats apply before translating coefficients into editorial priorities.

What You Can Control Directly

Schema implementation and Quick Answer box presence are the two highest-return controllable variables in this dataset. Domain authority takes months or years to move, and content age is set at publication. Schema can be implemented in hours, and a well-structured answer box at the top of an existing article can be added during a content refresh without a full rewrite.

Table count is also actionable. Finance content that currently presents comparative data in prose paragraphs or bullet lists would likely benefit from conversion to structured HTML tables. The OR of 1.55 per standard deviation of table count suggests that pages in the top quartile of table use have a meaningful citation advantage. This aligns with qualitative observation: Perplexity frequently quotes numeric comparisons directly from tables in its synthesized answers, and having the data pre-structured reduces the cognitive and computational cost of that extraction.

Domain Authority Remains Foundational

An OR of 2.03 for domain authority is large enough that no amount of schema optimization fully compensates for extremely low domain authority. A DA 20 site implementing every other recommendation in this dataset would still be outcompeted by a DA 60 site with average content structure. This is not a new finding, but it confirms that Perplexity's citation layer is not a clean break from the link-graph-weighted world of traditional search. Authority signals, likely proxied through web crawl data that Perplexity's retrieval system ingests, continue to matter.

What These Numbers Do Not Tell You

This study measures correlation in a controlled corpus, not Perplexity's internal ranking algorithm. Perplexity almost certainly uses multiple retrieval layers, including a real-time web search component, a curated index, and potentially different models for different query types. The features we measured are observable proxies, not the actual signals the model processes. A page could satisfy all measured features and still not be cited if it lacks topical relevance to a specific query phrasing.

We also did not measure query-level features such as query length, presence of a named entity, or whether the query was phrased as a question versus a statement. These are likely moderating variables. Future work should include them.

Frequently Asked Questions

How many queries were used in this citation ranking study?

The study used 1000 queries, all drawn from the personal finance and investing vertical, submitted to Perplexity between March 4 and March 19, 2025. Queries covered five subcategories including brokerage accounts, tax strategy, ETF analysis, mortgage and lending, and budgeting. Each query was run twice, 48 hours apart, to measure citation stability.

What is the strongest predictor of Perplexity citation according to the regression table?

In the logistic regression, Quick Answer box presence had the largest odds ratio at 2.43, meaning pages with a visible Quick Answer box near the top of the content were approximately 2.4 times more likely to be cited. Domain authority (OR 2.03) and schema presence (OR 1.95) were close behind. All three were statistically significant at p less than 0.001.

Does content freshness matter for Perplexity citations in finance?

Content age showed a statistically significant but small negative coefficient in the regression model (OR 0.91 per standard deviation). In practice, pages published within the last 30 days had a citation rate only 4.5 percentage points higher than pages over a year old. Freshness appears to be a weak signal compared to structure and domain authority for finance queries, though this may differ for time-sensitive topics like breaking market news.

Which schema type is most associated with Perplexity citations?

Among pages with structured data, FAQPage schema had the highest citation rate at 46.2%, followed by Article schema at 38.1%, HowTo at 35.4%, and generic WebPage at 28.7%. Pages with no schema at all had a citation rate of 19.6%. FAQPage schema may be particularly effective because it pre-structures question-and-answer content in a format that AI retrieval systems can parse directly.

How consistent are Perplexity citations across repeated queries?

Across the 1000-query dataset with two runs per query, 61.7% of queries produced identical citation sets in both runs. Stability was higher for navigational and regulatory queries where authoritative primary sources exist, and lower for analytical or opinion-type queries. This means citation optimization should be treated as probabilistic rather than a guaranteed outcome for any single query.

Is it possible to outrank high-DA sites in Perplexity citations with better content structure?

Partially. The regression model shows that schema, Quick Answer boxes, and table count all provide independent citation lift even after controlling for domain authority. However, the domain authority odds ratio of 2.03 is large enough that a very low-DA site cannot fully compensate with structure alone. The most realistic path for lower-authority publishers is to target narrow, specific finance queries where high-DA generalist sites have thin coverage, and to optimize structure aggressively for those specific pages.

Sources and Further Reading


← Back to JBAI Insider June 22, 2026