Building the Stack: Turning RAG Pipelines into enterprise-grade Data Subscriptions

by William Hakim April 16, 2026

Halcyon’s Antique Roadshow may have wrapped for 2026 (if you’re after a decommissioned substation for your home office warehouse, you’ll just have to wait for next year), but we wanted to take a moment to talk about how we built our first fully in-app data subscription (our Roadshow sale report), and what that portends for the future of Halcyon’s platform. To do that, we’ll peel back the layers of the Halcyon product stack, and show how our features build on each other to help energy professionals find signal amidst the regulatory information firehose.

Search and Filtering

Halcyon collects public data from over 71 U.S. state and federal agencies — more than 6.4 million documents as of writing — and a number of other authoritative sources. We’ve recently added Air Quality Permits from seven key markets (and counting!). With a corpus of this size, we’ve had to make decisions about what kind of search to build, and what kind of precision/recall tradeoffs to make. We believe that our users have a strong bias towards precision - they’d rather have the most up-to-date answer from exactly the right source the majority of the time than a mostly-correct answer from an outdated source all of the time. This belief led us to build a search experience with a heavy emphasis on metadata filtering, rather than the blank-text-box approach of consumer search engines like Google (which operate at a different point on the precision-recall curve).

In order to build powerful filters, we’ve needed to get up close and personal with each and every data source in our catalog and map out the right abstractions over their data. These abstractions range from the simple (knowing that a “case number” in the Ohio PUC’s system and a “proceeding id” in California’s system both correspond to a regulatory docket number) to the complex (training ML models that can recognize filing types like Integrated Resource Plans or Rate Cases from metadata alone, even across agencies that label things differently). They add up to a regulatory search engine that allows users to find documents with a high degree of precision by combining source, date, docket, keyword, filing type, and topic filters.

This is important because our approach to search is the foundation of everything else in the stack. All of the subsequent capabilities build atop the ability to reliably constrain the catalog with high precision.

Roadshow_diagram 01-1

Queries

Halcyon’s query pipeline builds on our search tech. In some ways, our queries use a relatively traditional Retrieval Augmented Generation (RAG) pipeline, and we face many traditional RAG problems (chunking, embedding, re-ranking, etc.). In other ways, the hard requirement of querying against complex metadata filters forced us to make significant investments in our technology. We’ve seen a number of cool projects that attempt to “solve” vector search at scale that have various drawbacks. For example, many database products currently in the market struggle with filtering metadata performantly while concurrently searching across a corpus of billions of embeddings. There is often a tradeoff between recall, fast queries, and the ability to compose complex filters that involve high degrees of cardinality.

Additionally, continuously iterating on search-and-filtering requires the ability to quickly update document and embedding metadata. We can now re-index our entire catalog in minutes or hours, rather than days or weeks. Search and filtering also informs our query pipeline via specialized re-rankers (e.g. using a graph of extracted authorship metadata to compute the authoritativeness of individual documents) and augmenting the raw data we pass to LLMs to rewrite queries and generate answers.

This all means that you can ask a natural-language question scoped to the jurisdictions and filing types you care about, and get a synthesized, cited answer drawn from the most authoritative sources in the catalog — not just a list of links.

Roadshow_diagram 2-1

Alerts

Alerts consist of 3 pieces: a windowing specification, a set of search filters, and a prototype query. The windowing spec aggregates the firehose of documents we ingest into groups; for example, a “weekly” spec generates a set of documents per week. (This specification also supports SQL-esque GROUP BY operators, allowing you to group documents across a search filter dimension such as “Docket ID,” resulting in one window of documents per unique docket per week.) The search filters are then applied to each window, further narrowing each slice down to a digestible chunk; each chunk in turn becomes the filters for a query. “Consumers” listen for the results of these queries; for example, an “email consumer” notifies the user who set up the alert of a new query response via email.

Alerts provide a powerful mechanism for staying up-to-date on developments you care about; rather than a single point-in-time query, Halcyon continually slices, filters and queries the firehose of data to bring you only relevant updates in near-real time. In the future, we’ll support other mechanisms of alert consumption.

Roadshow_diagram 3-1

Data Subscriptions

If you’ve used Halcyon’s Data Subscriptions before, you know them as .csv files, updated monthly, that our team compiles, QAs, and delivers as static downloads. They are useful, and more recent than anything else in the market, but they are snapshots: current as of the day we publish them, and already aging by the time you open the file.

Halcyon Antique Roadshow was our first data subscription built entirely on top of alerts, and it represents a fundamentally different model. Since alerts are composed of a search and a query, a grid with searches as rows and prototype queries as columns creates an alert per cell. (The windowing specification corresponds to how frequently the cell is updated.) A special “Data Subscription Consumer” takes each cell’s alert’s output and uses it to update the cell. One additional wrinkle for data subscriptions involves formatting the outputs of each alert; instead of a blob of unstructured text, users often want a single number per column, or an enum (e.g. “True,” “False,” or “Not Specified”). To solve this problem, columns automatically infer JSON schemas from their prototype query (and a bit of guidance around the expected output data type).

The implication is significant: data subscriptions that once lived in spreadsheets and updated monthly can now live in the platform and update on a near-continuous basis as new filings land. The information reflects the latest regulatory activity without waiting for a human to manually re-pull and re-process it.

Roadshow_diagram 4-1

What Comes Next

If you’ve been following along, you might recognize some of these ideas from the Crawl, Walk, Run roadmap we laid out alongside our Series A. In-app Data Subscriptions are the first real step into Walk territory: giving users structured, continuously updating answers to complex multi-dimensional questions, with the system doing the heavy lifting of decomposition, filtering, and synthesis.

Without giving the game away, we’re building towards a future in which end-users define the columns and rows using natural language and are able to ask their own questions of existing datasets or even build their own data subscriptions from scratch. Watch this space for more updates!

If you'd like to discuss our technical approach, I'd love to chat: william@halcyon.io

Subscribe for more content like this; reach out with questions: sayhi@halcyon.io; follow us on LinkedIn and Twitter

Search and Filtering

Queries

Alerts

Data Subscriptions

What Comes Next

RELATED ARTICLES

Five for Five

It's Not Just Data Centers