-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test residuals model based on building means #73
Comments
I worry about a naive (eg time decay) arbitrarily weighted rolling average - How do you plan on managing weights? I would strongly encourage exponential smoothing or kernel based weights, or something, so that we do not end up imputing more error through an arbitrary weighting function. The larger issue is that this assumes the building price (and other params) captures variation, and that the residual variation is independent. However, I highly doubt that, because it does not hold, given unit heterogeneity - which we know to be true, although these models make an attempt to impose homogeneity. Building price looks to be a confounding factor, ie overfitting on noise, and any building-level information is captured in other features. Perhaps a better approach would be a hierarchical modeling between unit and building level features. I also doubt we can assume linearity - this building-price-compression approach does not consider interactions between unit and building level params. Eg, price/sf may vary significantly between luxury and mid-tier buildings. (demand elasticities vary between and within defined "price bands"). It's worth mentioning this is almost certainly data leakage. Missing data can be responsibly imputed within best practices. |
We've used a Gaussian kernel in the past for time weighting and would likely do something similar here. If we want to parameterize it, we could use a first-pass CV loop that just determines some hyperparameter of the weighting function.
Obviously we'd like a model that doesn't rely so much on building-level features/aggregation, but the truth is that the unit-level features are incomplete, new, noisy, and not very predictive. Up until this year our only complete unit-level feature was % of ownership, which is set via a declaration filed with the County and is often plainly wrong. We now have unit square footage and number of bedrooms, but that's it. We know nothing about the interior characteristics, quality, floor, direction, or amenities of condo units. In the absence of good unit-level characteristics, using the weighted building mean is a decent first pass approach. It's not all that different from a spatial lag model/feature, where you're exploiting the spatial structure of the observations (and their outcome) to determine something about the target outcome. It also (roughly) mirrors peoples' expectations i.e. that units in the same building should have very similar prices. This approach obviously has trouble with very heterogeneous buildings, but so will basically any approach given the data that we have. That said, I'd definitely be open to some sort of multi-level model and would love to test one out early next year. Open to suggestions re: the structure and specification of such a model.
Using the leave-one-out mean would avoid data leakage but would result in no mean at all for a large number of smaller buildings. But open to suggestions here as well. |
Per discussion with @Douglasmsw, we may be able to simplify the condo model a bit by:
This doesn't account for buildings without sales/means, but we'll leave that for a separate issue.
The text was updated successfully, but these errors were encountered: