Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test residuals model based on building means #73

Open
dfsnow opened this issue Dec 19, 2024 · 2 comments
Open

Test residuals model based on building means #73

dfsnow opened this issue Dec 19, 2024 · 2 comments
Assignees
Labels
method ML technique or method change

Comments

@dfsnow
Copy link
Member

dfsnow commented Dec 19, 2024

Per discussion with @Douglasmsw, we may be able to simplify the condo model a bit by:

  1. Constructing the 5-year rolling weighted average sale price per building. Weighted by recency and inclusive of the target property. Also try the leave-one-out mean and possibly a (separate) spatial lag.
  2. Calculating the residuals using the results from 1 as the initial prediction/baseline.
  3. Fitting a model on the residuals, such that the model is essentially finding the difference between the building mean and each unit. Unit-level features could include pct. ownership, sf, and # beds/baths.
  4. Finalizing the unit value by adding the model prediction to the building mean.

This doesn't account for buildings without sales/means, but we'll leave that for a separate issue.

@dfsnow dfsnow added the method ML technique or method change label Dec 19, 2024
@ssaurbier
Copy link

ssaurbier commented Dec 20, 2024

I worry about a naive (eg time decay) arbitrarily weighted rolling average - How do you plan on managing weights? I would strongly encourage exponential smoothing or kernel based weights, or something, so that we do not end up imputing more error through an arbitrary weighting function.

The larger issue is that this assumes the building price (and other params) captures variation, and that the residual variation is independent. However, I highly doubt that, because it does not hold, given unit heterogeneity - which we know to be true, although these models make an attempt to impose homogeneity. Building price looks to be a confounding factor, ie overfitting on noise, and any building-level information is captured in other features. Perhaps a better approach would be a hierarchical modeling between unit and building level features.

I also doubt we can assume linearity - this building-price-compression approach does not consider interactions between unit and building level params. Eg, price/sf may vary significantly between luxury and mid-tier buildings. (demand elasticities vary between and within defined "price bands").

It's worth mentioning this is almost certainly data leakage.

Missing data can be responsibly imputed within best practices.

@dfsnow
Copy link
Member Author

dfsnow commented Dec 22, 2024

I worry about a naive (eg time decay) arbitrarily weighted rolling average - How do you plan on managing weights? I would strongly encourage exponential smoothing or kernel based weights, or something, so that we do not end up imputing more error through an arbitrary weighting function.

We've used a Gaussian kernel in the past for time weighting and would likely do something similar here. If we want to parameterize it, we could use a first-pass CV loop that just determines some hyperparameter of the weighting function.

The larger issue is that this assumes the building price (and other params) captures variation, and that the residual variation is independent. However, I highly doubt that, because it does not hold, given unit heterogeneity - which we know to be true, although these models make an attempt to impose homogeneity. Building price looks to be a confounding factor, ie overfitting on noise, and any building-level information is captured in other features. Perhaps a better approach would be a hierarchical modeling between unit and building level features.

I also doubt we can assume linearity - this building-price-compression approach does not consider interactions between unit and building level params. Eg, price/sf may vary significantly between luxury and mid-tier buildings. (demand elasticities vary between and within defined "price bands").

Obviously we'd like a model that doesn't rely so much on building-level features/aggregation, but the truth is that the unit-level features are incomplete, new, noisy, and not very predictive. Up until this year our only complete unit-level feature was % of ownership, which is set via a declaration filed with the County and is often plainly wrong. We now have unit square footage and number of bedrooms, but that's it. We know nothing about the interior characteristics, quality, floor, direction, or amenities of condo units.

In the absence of good unit-level characteristics, using the weighted building mean is a decent first pass approach. It's not all that different from a spatial lag model/feature, where you're exploiting the spatial structure of the observations (and their outcome) to determine something about the target outcome. It also (roughly) mirrors peoples' expectations i.e. that units in the same building should have very similar prices. This approach obviously has trouble with very heterogeneous buildings, but so will basically any approach given the data that we have.

That said, I'd definitely be open to some sort of multi-level model and would love to test one out early next year. Open to suggestions re: the structure and specification of such a model.

It's worth mentioning this is almost certainly data leakage.

Using the leave-one-out mean would avoid data leakage but would result in no mean at all for a large number of smaller buildings. But open to suggestions here as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
method ML technique or method change
Projects
None yet
Development

No branches or pull requests

3 participants