Back to Blog
CASE STUDY
9 min read
September 5, 2025

Building a Real Estate Analytics Engine from Scratch

How we built a property valuation and investment analytics platform that processes 12 million listings, integrates satellite imagery and census data, and produces valuations within 4.2% of appraised values across 30 metro areas.

Real estate data is not messy. It is adversarially messy, because every listing agent has an incentive to present incomplete or misleading information.

The Problem with Real Estate Data

A real estate investment firm managing $400M in residential properties across 30 US metro areas came to us with a deceptively simple request: build a system that could value any residential property in their target markets and identify properties that were undervalued relative to their characteristics and location.

Simple, right? Zillow has Zestimates. Redfin has estimates. CoreLogic has AVMs. The client had tried all of them and found them inadequate for investment decisions for three specific reasons.

First, public AVMs are optimized for accuracy across the entire market, not for accuracy on the tail of the distribution where investment opportunities exist. A model that is 3% accurate on average may be 15% inaccurate on the specific properties that are most interesting to investors: distressed properties, value-add opportunities, and properties in rapidly gentrifying neighborhoods.

Second, public AVMs update slowly. Zillow updates monthly. Market conditions in hot neighborhoods change weekly. By the time a public AVM reflects a price movement, the investment window has closed.

Third, the client needed investment-specific analytics that no public AVM provides: renovation ROI estimation, rental yield prediction, neighborhood trajectory scoring, and cash flow modeling under different financing scenarios.

We needed to build the entire analytics engine from the ground up.

Data Acquisition and Integration

The first and most time-consuming challenge was assembling a comprehensive property data asset. No single data source provides everything needed for accurate property valuation.

MLS data (via RETS/RESO API feeds): Active listings, pending sales, and closed sales with property characteristics. We ingested feeds from 30 MLS systems, each with different schemas, field naming conventions, and data quality standards. Some MLS systems use "SqFt" for living area. Others use "LivingArea," "GLA," or "TotalSqFt." Normalizing these into a unified schema required mapping 847 unique field names across all 30 systems into our canonical 156-field property record.

Public records (county assessor data): Tax assessments, ownership history, lot dimensions, building permits, and recorded sales. We sourced this from county APIs where available and from aggregated data providers (ATTOM, CoreLogic) elsewhere. The challenge here is that assessor data is often 6-18 months behind actual market conditions, and the assessment methodology varies by county.

Satellite and aerial imagery: We ingested high-resolution satellite imagery (30cm/pixel) from Maxar and aerial imagery from Nearmap for all target metro areas. This provides information that no structured data source captures: roof condition, yard maintenance, presence of a pool or accessory dwelling unit, proximity to visual nuisances (power lines, commercial properties, highways).

Census and demographic data: ACS 5-year estimates at the census tract level for income, education, age distribution, household composition, and commute patterns. These features are strong predictors of neighborhood-level price trends.

Walk Score, Transit Score, and Bike Score: API access for location-based livability metrics.

School ratings: GreatSchools API for K-12 school ratings and attendance zone boundaries. School quality is the single strongest location-based predictor of residential property values in suburban markets.

Crime data: FBI UCR data and local police department APIs where available. Crime rates at the census tract level correlate with price levels but must be used carefully due to reporting inconsistencies.

The total data asset after integration: 12.3 million property records with 247 features per record, 4.1 TB of satellite imagery, and 18 months of historical listing and sales data.

The Valuation Model

We built the valuation model in three stages, each addressing a different aspect of the problem.

Stage 1: Comparable Sales Model

The foundation of any property valuation is comparable sales. What did similar properties in the same area sell for recently? This is how human appraisers work, and it is the right starting point.

Our comp model uses a learned similarity metric to identify comparable properties. Rather than applying fixed rules (same zip code, same bedroom count, within 20% on square footage), we trained a metric learning model that produces an embedding for each property. Properties that transact at similar prices are close in embedding space; properties that transact at different prices are far apart.

The model is a gradient boosted tree that predicts sale price from property features and location, but we extract the leaf node assignments as an implicit embedding. Two properties that land in similar leaf nodes across the ensemble are "comparable" in the model's learned sense. This approach captures complex interactions that simple heuristic rules miss. A 2,000 sq ft ranch in a flood zone is not comparable to a 2,000 sq ft ranch on high ground, even if they are in the same zip code.

For each target property, we retrieve the 20 nearest neighbors in embedding space from the last 12 months of sales, weight them by recency and distance, and compute a weighted median sale price adjusted for property-level differences.

Stage 2: Hedonic Pricing Model

The comp model works well in dense markets with frequent transactions but fails in sparse markets where comparable sales are rare. The hedonic model fills this gap by learning a functional relationship between property characteristics and price.

We use a LightGBM model with 247 features. The key engineering decisions:

Spatial features: Rather than using zip code or census tract as categorical features (which creates thousands of sparse categories), we use spatial embeddings. Each property's latitude and longitude are passed through a trained spatial embedding network that produces a 32-dimensional location representation. This captures neighborhood effects at a much finer granularity than administrative boundaries.

Temporal features: We include month-of-sale and compute time-varying market indices at the zip code level to capture local price trends. A property's predicted value depends not just on its characteristics but on when it is being valued relative to local market cycles.

Interaction features: We engineered 40 interaction features based on domain knowledge. Square footage interacts with bedroom count (a 1,500 sq ft 5-bedroom home is very different from a 1,500 sq ft 2-bedroom home). Lot size interacts with metro area (a 0.25-acre lot is large in Manhattan and tiny in Houston). Pool presence interacts with climate zone (a pool adds value in Phoenix and subtracts value in Minneapolis due to maintenance costs).

The hedonic model achieves a median absolute percentage error (MAPE) of 4.2% on held-out sales data across all 30 metro areas. This is competitive with Zillow's published accuracy (3.5% MAPE) while being optimized for our target property types.

Stage 3: Computer Vision Augmentation

The satellite imagery provides signal that structured data cannot capture. We built a property-level computer vision pipeline that extracts features from aerial imagery:

Roof condition classifier: A ResNet-50 model trained on 15,000 manually labeled roof images, classifying roof condition into four categories: excellent, good, fair, and poor. Roof condition is a proxy for overall property maintenance and correlates with deferred maintenance costs.

Property improvement detector: A multi-label classifier that detects the presence of pools, decks, accessory dwelling units, detached garages, and significant landscaping from aerial imagery. These features are often missing or inaccurate in MLS data.

Neighborhood quality scorer: A custom CNN that scores the visual quality of a property's surroundings within a 200m radius. It evaluates factors like vegetation density, road quality, neighboring property maintenance, and presence of commercial or industrial uses. This feature captures neighborhood effects that census data misses.

Adding computer vision features improved the hedonic model's MAPE from 4.2% to 3.8%, a meaningful improvement that justifies the complexity of the imagery pipeline.

Investment Analytics Layer

The valuation model is the foundation, but the client needed investment-specific analytics that go beyond "what is this property worth."

Undervaluation score: The ratio of current listing price (or estimated market value for off-market properties) to the model's predicted value. Properties with a score below 0.92 are flagged as potentially undervalued. We calibrate this threshold to produce approximately 200-300 candidates per metro area per month, which matches the client's deal team capacity.

Renovation ROI estimator: For properties flagged as value-add opportunities, we estimate the ROI of specific renovations. The model uses historical permit data and sales pairs (same property sold before and after renovation) to estimate the value uplift from kitchen remodels, bathroom additions, roof replacement, and cosmetic updates. This varies dramatically by market. A kitchen remodel returns 85% of cost in San Diego but 120% in Nashville, due to differences in labor costs and market price levels.

Rental yield predictor: For properties being evaluated as rental investments, we predict gross and net rental yield using a separate model trained on rental listing data from Zillow Rental Manager, Apartments.com, and local MLS rental listings. This model accounts for property characteristics, location, and local rental market conditions.

Neighborhood trajectory score: A 5-year forward-looking score that predicts whether a census tract will appreciate faster or slower than its metro area average. Features include demographic trends, building permit activity, business formation rates, transit investment, and school quality trajectories. This is the most speculative of our models, but the client finds it valuable for identifying emerging neighborhoods before price appreciation reflects the fundamental improvements.

Deployment and Operations

The system runs on AWS with the following components:

  • Data ingestion: Airflow DAGs that pull from 30+ data sources on schedules ranging from hourly (MLS feeds) to monthly (census data). Total daily data volume: approximately 2GB of structured data plus 50GB of imagery.
  • Feature store: PostgreSQL with PostGIS for spatial queries, backed by Redis for hot features used in real-time valuation requests.
  • Model serving: FastAPI application serving the valuation model with sub-500ms response time per property. Batch valuation of the entire 12.3M property database runs nightly.
  • Imagery pipeline: GPU-accelerated batch processing of satellite imagery, running on spot instances. Full refresh of 12.3M property imagery features runs weekly.
  • Dashboard: React application with Mapbox GL for geographic visualization of opportunities.

Results and Ongoing Work

After 12 months of production operation:

  • The client evaluated 3,400 properties flagged by the system and acquired 127 properties.
  • Acquired properties' appraised values averaged 7.2% above purchase price, validating the undervaluation scoring.
  • The renovation ROI estimator was within 15% of actual renovation returns on 83% of completed projects.
  • The neighborhood trajectory score correctly identified 71% of outperforming census tracts in the first year of forward-looking predictions.

The system continues to improve as we accumulate more training data, particularly for the renovation ROI and neighborhood trajectory models where outcome data takes months to years to materialize.

The key lesson from this project: in real estate analytics, the model is the easy part. Data acquisition, normalization, and quality control across dozens of heterogeneous sources is where 70% of the engineering effort goes. Any team attempting to build a real estate analytics platform should budget accordingly.

Discussion (2)

EM
eng_manager_techEngineering Manager · Technology1 week ago

Solid technical depth. This is the kind of content that makes me actually trust a vendor — they clearly know what they're talking about because nobody writes at this level of specificity without real experience.

M
Mostafa DhouibAuthor1 week ago

That's the goal — we write about what we've actually done, not what we've read about. Every article is based on real deployment experience, real numbers, real failures. Thanks for reading.

M
Mostafa DhouibFounder & ML Engineer at Opulion

Facing a similar challenge?

Tell us about your problem. We'll respond with an honest technical assessment within 24 hours.