Under the Hood of Assessorly

We don’t guess—we measure. From the first click, we use maps and math to find truly comparable sales—nearby, recent, and similar in features. Each candidate is scored on distance, recency, and property differences, so you get defensible results you can see. We’re your advocate, not a black box.

Data layer: Amazon RDS for MariaDB 11.6 (InnoDB), B-trees, and spatial indexes (R-tree)

Engine: Amazon RDS for MariaDB 11.6 (InnoDB)
- B-tree for equality/range filters: e.g., (county_id), (parcel_id, sale_date),(year_built), (gla).
  - Point lookup: O(log_B N)
  - Range lookup (k rows): O(log_B N + k)
  - Space: O(N)
- Spatial index (R-tree) on POINT for bounding-box prefilter.
  - Avg query: O(log_M N + c) (c = candidates)
  - Worst-case: O(N)
- Primary keys: BINARY(16) UUIDs
- Geo: POINT (SRID=4326) for units/parcels

Comparable selection—algorithm & complexity

Staged pipeline: hard filters → spatial prefilter → precise distance → score/rank → top-k.

Scoring

\[ S(i) = w_d \cdot \frac{1}{1 + \operatorname{dist}(t,i)} + w_r \cdot e^{-\lambda \cdot \Delta\text{days}(i)} - w_g\left|\frac{GLA_i - GLA_t}{GLA_t}\right| - w_y \lvert year_i - year_t \rvert - w_a \lvert acres_i - acres_t \rvert \]

Weights are learned or hand‑tuned per market.

Great-circle distance (haversine)

\[ d = 2R\,\arcsin\!\left(\sqrt{\sin^2\!\tfrac{\Delta\varphi}{2} + \cos\varphi_t\cos\varphi_i\,\sin^2\!\tfrac{\Delta\lambda}{2}}\right),\quad R \approx 6{,}371{,}000\,\text{m} \]

Complexity

Hard SQL filters (B-tree): O(log N + k), returning k rows
Spatial prefilter (R-tree bbox): O(log N + c) candidates
Precise distance for c: O(c)
Top-k selection (heap): O(c \log k)
Total: O(\log N + c \log k)

Here N = rows, c = candidates after bbox, k = results retained, p = features, n = samples.

Fuzzy address search (Levenshtein DP)

Fast, fuzzy lookups on full_address to tolerate typos, abbreviations, and partial inputs. Uses a two‑row Levenshtein DP with optional banding.

Two‑row DP (O(min(n,m)) space)

\[ D_{i,j} = \min\{ D_{i-1,j} + 1,\; D_{i,j-1} + 1,\; D_{i-1,j-1} + [a_i \neq b_j] \}\,;\quad D_{0,j}=j,\; D_{i,0}=i \]

Per string (banded): O(d·L), where d is the band width and L the string length; top‑k heap isO(U \log k) vs full sort O(U \log U), where U = candidates.

Valuation (AVM) — multiple linear regression

Interpretable MLR with ridge regularization and sanity constraints (e.g., more GLA should not reduce value).

Model

\[ y = \beta_0\,\mathbf{1} + X\,\boldsymbol{\beta} + \varepsilon \]\[ \hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \; \lVert y - X\boldsymbol{\beta} \rVert_2^2 + \lambda\, \lVert \Gamma\boldsymbol{\beta} \rVert_2^2 \]

Training complexity (QR/SVD): O(n p^2 + p^3). Prediction: O(p) per unit.

Empirical‑Bayes / fixed‑effects meta‑analysis of past reports, implemented with:

Weighted Welford online stats to stream means/variances per feature (and intercept) globally and by region (running mean + M2 → variance = M2/weight).
Precision‑weighted pooling to combine sources (global + region).
Weights are the comparable count per report; 5‑minute cache on the snapshots.

This is inverse‑variance weighting (Normal–Normal conjugate update), not a random‑effects model.

ETL & geo ingest — streaming, concurrency, and costs

Format: line‑delimited GeoJSON, TIGER/Line shapefiles, county JSON
Pattern: staging → normalize → idempotent upsert (hash guard)
Geometry: ST_GeomFromWKB/WKT, SRID=4326 enforced
Concurrency: bounded workers (I/O vs DB write balance)

Observability: Grafana, Prometheus, GA4

Metrics: Prometheus → Grafana (API p95, DB latency, queue depth, Longhorn volumes)
Product analytics: Google Analytics (GA4)

Security & secrets: Vault

Vault for scoped DB creds + API tokens, with rotation
Ingress‑NGINX + ModSecurity; TLS via Let’s Encrypt or AWS ACM
AWS SSM; least‑privilege IAM

AWS deployment — GitHub Actions

The AWS footprint (Route 53, ACM, S3, CI runners, etc.) is maintained in a public repo: MilesSystems/aws-deployment. Diagrams in that repo’s Diagrams/ folder are embedded below and auto‑refresh from GitHub.

Infrastructure: Harvester, Longhorn, R640/R510, AWS

Harvester (KVM + RKE2) for local virtualization + Kubernetes
Longhorn distributed block storage (snapshots, rebuilds, replicas)
R640: compute / hot I/O (DB, API, AVM)
R510: capacity / archive / bulk datasets
AWS: Route 53, ACM, S3; self‑hosted GitHub Actions runners

Why this works (performance & explainability)

Sub‑second search by keeping candidates c small and using heap top‑k
Auditable regression with intervals + constraints
Linear‑time streaming ETL
Right‑sized hybrid infra (R640 hot‑path, R510 bulk; Longhorn resilience)

Quick reference (Big‑O)

B‑tree point: O(\log N)
B‑tree range (k rows): O(\log N + k)
R‑tree bbox: O(\log N + c) avg
Distance calc (c): O(c)
Top‑k heap: O(c \log k)
End‑to‑end comps: O(\log N + c \log k)
MLR train (QR/SVD): O(n p^2 + p^3)
MLR predict: O(p)
ETL parse+write: O(N \log N)

Forward paths

Quantile regression for robust intervals
Monotonic constraints across core features (GLA, lot size)
Adaptive indexer driven by slow‑query logs
Neighborhood embeddings to prune false comps