Hi there, I'm Allen! I'm building intelligent systems to power a
more sustainable world at
Nectar. We apply AI advancements to disrupt the way
companies buy energy for their operations.
I grew up in a quiet town in Massachusetts, where I studied math and chemistry. At MIT, I discovered a passion for
building software. I owe the positive life experiences over the years to serendipitous friendships and whimsical
side-quests. I'm grateful to have the rare opportunity to work on hard problems with super talented people.
Nectar
Nectar is an AI climate-tech company that's building the
future of AI infrastructure sustainably. We believe that for AI to reach massive adoption across all industries
globally, the following must be accomplished:
-
Easy access to cheap and clean energy to power datacenters and emissive companies
-
Agent-friendly web infrastructure that allows web-agents to iterate at scale
-
Feature extraction from unstructured pdfs, images, and html with 99.999% accuracy
At Nectar, we're building (1) externally and (2) + (3) internally. Our main product monitors companies' energy
usage and procures electricity at the best price for their needs. To solve this problem at scale, we built
scrapers to download terabytes of data. This data, like most data on the internet, is unstructured: think pdfs,
websites, emails; so we built data pipelines to pull that into one organized schema. We aspire to one day open
source our internal products to help the community build more sustainable AI.
Example of some things that we've built:
-
Text forwarding service on a raspberry pi to allow our service to maintain ongoing access to
MFA / 2FA protected websites. We tried Google Voice, Twilio, and a few other solutions; and we realized that to
do it right, we had to do it ourselves.
-
Self-hosted infrastructure to scrape websites in headed mode, building on top of Browserbase and its competitors. Our scrapers run from
Mac minis in our office proxied through residential IP addresses.
-
99% field-level accuracy on structured data extraction from unstructured documents, images, and
html. We use temporal for orchestration and workflow management
and leverage Reducto for OCR.
However, we still have a long way to go. Most notably, 99% isn't good enough for our product:
each document needs around 1000 JSON fields to be extracted, so to perfectly parse a document 99% of the time
(i.e. be correct on all 1000 fields), we require 99.999% field-level accuracy (assuming a uniform distribution of
errors). Unfortunately, existing AI tools often prioritize latency over accuracy because highly valued AI
companies like coding or customer service agents can tolerate occasional hallucinations and need immediate
responses. Our product requires more 9's in the SLA and we can afford to wait on more LLM calls. So, most existing
frameworks don't work for us, and we end up building a lot in-house. Some of the problems we're working on right
now:
-
Prompt design abstractions: We're thinking about how to make prompt design more modular,
prioritizing reusability and testability. Sometimes slightly different tasks only require small changes to the
prompt, and we'd like to keep things DRY. Frameworks like Langchain end
up abstracting away too much of the design process, making debugging and testing more difficult. Cursor's priompt is a strong move in the right
direction. However, our use case differs from Cursor's priompt because we're much more accuracy-sensitive. When
the context window is limiting, priompt truncates the input to fit into the 200k context window. Truncation
risks losing critical context, so we have to either divide and conquer or compress the context losslessly. How
can we iterate on priompt for 99.999% SLA applications?
-
Long list extraction: Imagine trying to extract a list of unique people mentioned in a
200-paged magazine. When the extracted list is long, we face challenges in LLMs becoming lazy or duplicating
data it's already seen. There are tricky logical duplicates that we can't easily handle with simple
deduplication (e.g. "Barack Obama's daughter's mother", and "the first Lady", and "Mrs. Obama" are considered
the same person, but if Mrs. Obama actually means Barack Obama's mother, then it's considered different). Even
worse, when we divide list extraction into multiple steps, often we lose the thoughts / reasoning in the earlier
steps and struggle to retain that context. What tools do we need to give LLMs to help with list extraction and
deduplication? Will a scratchpad for writing thoughts suffice?
-
Scalable scraper infrastructure: We have a large number of scrapers that we use to download
data from the internet. We need to make sure that we can scale these scrapers to terabytes of data per day,
avoid rate limiting, solve recaptcha, and more. Reworkd, Browserbase, and Brightdata have made good progress, but we still need to build a lot of infrastructure to
make sure that we can scale to our needs.
-
Schema design: One unique challenge of our product is that we track relationships between
objects that are time-dependent. E.g. a building in our database may be associated with customer number 123 in
2024 but can become associated with customer 456 in 2025. How can we track and query these relationships with
time as a new dimension? For utility companies in particular, sometimes there are multiple ids (your account
number vs your meter number, etc.) that are associated with an account, and we need to build in the flexibility
to allow our customers to query by their id of choice. How can we build a schema that scales and is flexible
enough to serve our customers' needs while trying to follow principles like 3NF?
-
Testing infrastructure: Most ML tools like Weights and Biases are designed for model testing when training costs dominate the testing
costs. For AI applications building, testing cost ends up dominating (costs between $0.20 - $2.00 per document).
Caching helps a little, but we still can't run our data pipeline on 1000s of documents for each PR just to test
a new approach. At the same time, we need to make sure that when we improve the model for one type of document,
performance doesn't regress on other documents. One interesting proposal is to build a large test set of
documents and tag which functions / prompts in the codebase are responsible for the performance of the pipeline
on the document. Then on each push, we sample from the list of documents that would be affected. Obviously, this
still feels quite expensive to maintain, but it's a start. What tools should we build for efficient testing?
If you have interesting ideas for these problems, or think you'd have a blast working on them, you should
consider joining our team full-time. We're a talent-dense group of researchers, engineers, and hackers
based in SF. If you'd like to chat, please reach out to me at allen [at] nectarclimate [dot] com.
Thanks to Cathy Cai, Antonio Frigo, and Ethan Yeh for reading initial versions of the website.