Trip Segmentation Optimization #1014

JGreenlee · 2025-01-30T19:26:36Z

No description provided.

I noticed that in trip_segmentation, we query the user timeseries for background/filtered_location. Then we call segment_into_trips which performs the same query on the same timeseries with the same timequery. Passing the loc_df as an argument eliminates the need for this query

restart_checking has 2 functions that are called repeatedly from both segmentation filter classes: `is_tracking_restarted_in_range` and `get_ongoing_motion_in_range`. Both of these functions performed a DB query (for `statemachine/transition` and `background/motion_activity` entries, respectively) In the case of the former, we already have all the `statemachine/transition` in the current processing time range kept in memory as a dataframe. This can be passed as an (optional) argument; we just have to then filter it down to the range of start_ts -> end_ts. We do not have all the `background/motion_activity` kept as a dataframe but we can load them all at once and filter down later, just as we do with the`statemachine/transition`. This will use more memory but I think it is likely still more efficient than multiple queries because DB calls are a bottleneck on production. I will investigate the effect of these changes further, but by my inital estimates, this drastically reduces the number of DB queries during trip segmentation (by a magnitude of ~100)

shankari

This change seems fine; it is a pretty straightforward change in which we pre-load a bunch of values and then re-use them instead of loading them lazily.

I am fine with merging this so we can see the impact on the DB queries.

As part of cleanup, I would like to see:

what was justification for this change? In particular, what were the timing results from our sample programs (stage, ccebike, smart commute, stm-community) that led you to focus on these areas?
I see that we are querying for loc_df in trip_segmentation and passing it in the time/distance filters. And then we are querying for transition and motion activity inside segment_into_trips. Why? It seems like we can just read all input data from that time range and pass it into segment_into_trips at the same time, in three different dataframes. This will not have an impact on performance, but it is cleaner and easier to understand.

JGreenlee force-pushed the segmentation_optimization branch from 740c75d to 6243b40 Compare January 30, 2025 19:59

JGreenlee added 3 commits January 30, 2025 15:33

update trip segmentation tests

11fd09e

JGreenlee force-pushed the segmentation_optimization branch from 6243b40 to 11fd09e Compare January 30, 2025 20:34

JGreenlee mentioned this pull request Jan 31, 2025

Pipeline Optimization Strategies e-mission/e-mission-docs#1105

Open

JGreenlee marked this pull request as ready for review January 31, 2025 22:06

shankari approved these changes Jan 31, 2025

View reviewed changes

shankari merged commit c941e5d into e-mission:master Jan 31, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trip Segmentation Optimization #1014

Trip Segmentation Optimization #1014

JGreenlee commented Jan 30, 2025

shankari left a comment

Trip Segmentation Optimization #1014

Trip Segmentation Optimization #1014

Conversation

JGreenlee commented Jan 30, 2025

shankari left a comment

Choose a reason for hiding this comment