Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trip Segmentation Optimization #1014

Merged
merged 3 commits into from
Jan 31, 2025

Conversation

JGreenlee
Copy link
Contributor

No description provided.

@JGreenlee JGreenlee force-pushed the segmentation_optimization branch from 740c75d to 6243b40 Compare January 30, 2025 19:59
I noticed that in trip_segmentation, we query the user timeseries for background/filtered_location. Then we call segment_into_trips which performs the same query on the same timeseries with the same timequery.
Passing the loc_df as an argument eliminates the need for this query
restart_checking has 2 functions that are called repeatedly from both segmentation filter classes: `is_tracking_restarted_in_range` and `get_ongoing_motion_in_range`.
Both of these functions performed a DB query (for `statemachine/transition` and `background/motion_activity` entries, respectively)

In the case of the former, we already have all the `statemachine/transition` in the current processing time range kept in memory as a dataframe. This can be passed as an (optional) argument; we just have to then filter it down to the range of start_ts -> end_ts.

We do not have all the `background/motion_activity` kept as a dataframe but we can load them all at once and filter down later, just as we do with the`statemachine/transition`. This will use more memory but I think it is likely still more efficient than multiple queries because DB calls are a bottleneck on production.

I will investigate the effect of these changes further, but by my inital estimates, this drastically reduces the number of DB queries during trip segmentation (by a magnitude of ~100)
@JGreenlee JGreenlee force-pushed the segmentation_optimization branch from 6243b40 to 11fd09e Compare January 30, 2025 20:34
@JGreenlee JGreenlee marked this pull request as ready for review January 31, 2025 22:06
Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems fine; it is a pretty straightforward change in which we pre-load a bunch of values and then re-use them instead of loading them lazily.

I am fine with merging this so we can see the impact on the DB queries.

As part of cleanup, I would like to see:

  • what was justification for this change? In particular, what were the timing results from our sample programs (stage, ccebike, smart commute, stm-community) that led you to focus on these areas?
  • I see that we are querying for loc_df in trip_segmentation and passing it in the time/distance filters. And then we are querying for transition and motion activity inside segment_into_trips. Why? It seems like we can just read all input data from that time range and pass it into segment_into_trips at the same time, in three different dataframes. This will not have an impact on performance, but it is cleaner and easier to understand.

@shankari shankari merged commit c941e5d into e-mission:master Jan 31, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Tasks completed
Development

Successfully merging this pull request may close these issues.

2 participants