Test residuals model based on building means #73

dfsnow · 2024-12-19T21:20:33Z

Per discussion with @Douglasmsw, we may be able to simplify the condo model a bit by:

Constructing the 5-year rolling weighted average sale price per building. Weighted by recency and inclusive of the target property. Also try the leave-one-out mean and possibly a (separate) spatial lag.
Calculating the residuals using the results from 1 as the initial prediction/baseline.
Fitting a model on the residuals, such that the model is essentially finding the difference between the building mean and each unit. Unit-level features could include pct. ownership, sf, and # beds/baths.
Finalizing the unit value by adding the model prediction to the building mean.

This doesn't account for buildings without sales/means, but we'll leave that for a separate issue.

ssaurbier · 2024-12-20T16:51:49Z

I worry about a naive (eg time decay) arbitrarily weighted rolling average - How do you plan on managing weights? I would strongly encourage exponential smoothing or kernel based weights, or something, so that we do not end up imputing more error through an arbitrary weighting function.

The larger issue is that this assumes the building price (and other params) captures variation, and that the residual variation is independent. However, I highly doubt that, because it does not hold, given unit heterogeneity - which we know to be true, although these models make an attempt to impose homogeneity. Building price looks to be a confounding factor, ie overfitting on noise, and any building-level information is captured in other features. Perhaps a better approach would be a hierarchical modeling between unit and building level features.

I also doubt we can assume linearity - this building-price-compression approach does not consider interactions between unit and building level params. Eg, price/sf may vary significantly between luxury and mid-tier buildings. (demand elasticities vary between and within defined "price bands").

It's worth mentioning this is almost certainly data leakage.

Missing data can be responsibly imputed within best practices.

dfsnow · 2024-12-22T22:32:05Z

I worry about a naive (eg time decay) arbitrarily weighted rolling average - How do you plan on managing weights? I would strongly encourage exponential smoothing or kernel based weights, or something, so that we do not end up imputing more error through an arbitrary weighting function.

We've used a Gaussian kernel in the past for time weighting and would likely do something similar here. If we want to parameterize it, we could use a first-pass CV loop that just determines some hyperparameter of the weighting function.

The larger issue is that this assumes the building price (and other params) captures variation, and that the residual variation is independent. However, I highly doubt that, because it does not hold, given unit heterogeneity - which we know to be true, although these models make an attempt to impose homogeneity. Building price looks to be a confounding factor, ie overfitting on noise, and any building-level information is captured in other features. Perhaps a better approach would be a hierarchical modeling between unit and building level features.

I also doubt we can assume linearity - this building-price-compression approach does not consider interactions between unit and building level params. Eg, price/sf may vary significantly between luxury and mid-tier buildings. (demand elasticities vary between and within defined "price bands").

Obviously we'd like a model that doesn't rely so much on building-level features/aggregation, but the truth is that the unit-level features are incomplete, new, noisy, and not very predictive. Up until this year our only complete unit-level feature was % of ownership, which is set via a declaration filed with the County and is often plainly wrong. We now have unit square footage and number of bedrooms, but that's it. We know nothing about the interior characteristics, quality, floor, direction, or amenities of condo units.

In the absence of good unit-level characteristics, using the weighted building mean is a decent first pass approach. It's not all that different from a spatial lag model/feature, where you're exploiting the spatial structure of the observations (and their outcome) to determine something about the target outcome. It also (roughly) mirrors peoples' expectations i.e. that units in the same building should have very similar prices. This approach obviously has trouble with very heterogeneous buildings, but so will basically any approach given the data that we have.

That said, I'd definitely be open to some sort of multi-level model and would love to test one out early next year. Open to suggestions re: the structure and specification of such a model.

It's worth mentioning this is almost certainly data leakage.

Using the leave-one-out mean would avoid data leakage but would result in no mean at all for a large number of smaller buildings. But open to suggestions here as well.

dfsnow added the method ML technique or method change label Dec 19, 2024

dfsnow assigned wrridgeway and dfsnow Dec 19, 2024

ssaurbier mentioned this issue Dec 20, 2024

Revisit condo strata imputation #7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test residuals model based on building means #73

Test residuals model based on building means #73

dfsnow commented Dec 19, 2024

ssaurbier commented Dec 20, 2024 •

edited

Loading

dfsnow commented Dec 22, 2024 •

edited

Loading

Test residuals model based on building means #73

Test residuals model based on building means #73

Comments

dfsnow commented Dec 19, 2024

ssaurbier commented Dec 20, 2024 • edited Loading

dfsnow commented Dec 22, 2024 • edited Loading

ssaurbier commented Dec 20, 2024 •

edited

Loading

dfsnow commented Dec 22, 2024 •

edited

Loading