Refactored retrieval of last API call timestamps to improve performance. #156

TeachMeTW · 2025-01-02T05:05:56Z

Description

This PR refactors the process for retrieving the last API call timestamps to significantly improve performance. The previous implementation relied on a complex aggregation pipeline, which has been replaced with a simpler, iterative approach using profile db like previous enhancements.

Benefits

Improved Performance: Reduces database load and speeds up the retrieval process.
Simplified Logic: Enhances maintainability and readability of the code.
Scalability: Handles larger datasets more efficiently.

Impact

The changes result in faster execution and lower computational costs, particularly for scenarios with large user datasets like openaccess.

Testing

Tested by loading a dataset and seeing the overview page's Active Users
Logging checks

Points of concern

Do we have a way to manually make something an active user? I know the criteria according to the code and blame is one day delta. My naive approach is to manually change said ts to validate.

- Changed from using an aggregation pipeline on `get_timeseries_db` to a loop fetching data from `get_profile_db`. - New approach iterates over `uuid_list`, fetching profile data and extracting `last_call_ts` for each user. - Simplifies logic, avoids heavy aggregation, and reduces database load. Results in a significant performance improvement.

shankari · 2025-01-02T17:41:07Z

@TeachMeTW I would like some more technical detail in the benefits and impact. I am not a general audience 😄

Can you please clarify:

which page loading time does this affect?
Is this related to the slow loading time on the Overview page (from Enhance _get_and_store_range Function to Include Trip Statistics and Last API Call Tracking e-mission-server#993 (comment))
what was the testing that you did with "Tested by loading a dataset and seeing the overview page's Active Users"
- was it just that the page loaded? or was it that it was faster to load?

TeachMeTW · 2025-01-02T18:17:07Z

@shankari Here's a breakdown of the clarification you requested:

Which page loading time does this affect?
This primarily affects the Overview page, specifically the Active Users card component.
Is this related to the slow loading time on the Overview page (from Enhance _get_and_store_range Function to Include Trip Statistics and Last API Call Tracking e-mission-server#993 (comment))?
Yes, it is. The update now leverages the last API call in the profiledb rather than relying on a MongoDB query, improving performance (since we already did processing during the pipeline run.
What was the testing that you did with "Tested by loading a dataset and seeing the overview page's Active Users"?
I tested this by:
- Loading a dataset like openaccess.
- Navigating to the Overview page.
- Observing the loading time visually to estimate improvements.
  My next step is to load a specific UUID, update it, and verify that it's correctly classified as an active user. The current is pretty adhoc
Was it just that the page loaded? Or was it that it was faster to load?
The components load much faster now. Previously, the Active Users card was either nonexistent until fully loaded or appeared grey with skeleton loading (if implemented like in the future branch). This fix ensures the component loads with about the same speed as other components, improving the overall experience.

TeachMeTW · 2025-01-02T18:20:21Z

Old

Screen.Recording.2025-01-02.at.10.18.22.AM.mov

New

Screen.Recording.2025-01-02.at.10.19.27.AM.mov

Takeaways

As we can see, it is much faster. I would just need to test my hypothesis above in where loading a specific uuid and updating it would classify as an active user.

shankari · 2025-01-02T19:01:19Z

@TeachMeTW did you look through the code to check if there are any other sites that are making direct DB calls, or calls to the timeseries for profile data? When we finish a particular performance improvement, I want to get it done fully and not over multiple weeks. Each improvement is already carefully scoped to be small and self-contained.

TeachMeTW · 2025-01-02T19:03:26Z

Hypothesis verified.

First I ran ./e-mission-py.bash bin/debug/load_timeline_for_day_and_user.py emission/tests/data/real_examples/shankari_2016-07-25 dummy_test in order to create a new user

Next I ran ./e-mission-py.bash bin/debug/intake_single_user.py -e dummy_test in order to load into the pipeline

Lastly, I ran

test_call_ts = time.time()
testUUID = uuid.UUID("UUID of the dummy_test")

enac.store_server_api_time(testUUID, "test_call_ts", test_call_ts, 69420)
etc.runIntakePipeline(testUUID)

to simulate an API call.

It does show as an active user:

TeachMeTW · 2025-01-02T19:13:33Z

@TeachMeTW did you look through the code to check if there are any other sites that are making direct DB calls, or calls to the timeseries for profile data? When we finish a particular performance improvement, I want to get it done fully and not over multiple weeks. Each improvement is already carefully scoped to be small and self-contained.

In the home page, I only found one Mongo query which was the active_users card which this pr addresses. Previously in db_utils, there were Mongo queries earlier, but they were replaced in a previous PR -- the user stats for the data page. I also checked other pages:

Data page: No Mongo queries found.
Map page: Looks good, no direct DB calls.
Push_notifs page: Also looks good, no issues.
Segment_trip_time page: No Mongo queries detected.

However, in db_utils, I noticed direct DB access in the query_segments_crossing_endpoints() function with the query:

not_excluded_uuid_query = {'user_id': {'$nin': [UUID(uuid) for uuid in excluded_uuids]}}

In db_utils there is also this chunk:

    # Vestigial code commented out and left below for future reference

    # logging.debug("Querying the UUID DB for %s -> %s" % (start_date,end_date))
    # query = {'update_ts': {'$exists': True}}
    # if start_date is not None:
    #     # have arrow create a datetime using start_date and time 00:00:00 in UTC
    #     start_time = arrow.get(start_date).datetime
    #     query['update_ts']['$gte'] = start_time
    # if end_date is not None:
    #     # have arrow create a datetime using end_date and time 23:59:59 in UTC
    #     end_time = arrow.get(end_date).replace(hour=23, minute=59, second=59).datetime
    #     query['update_ts']['$lt'] = end_time
    # projection = {
    #     '_id': 0,
    #     'user_id': '$uuid',
    #     'user_token': '$user_email',
    #     'update_ts': 1
    # }

    logging.debug("Querying the UUID DB for (no date range)")

    # This should actually use the profile DB instead of (or in addition to)
    # the UUID DB so that we can see the app version, os, manufacturer...
    # I will write a couple of functions to get all the users in a time range
    # (although we should define what that time range should be) and to merge
    # that with the profile data```

pages/home.py

…get. Reduced the bloat of operations by leaving only one for loop that does the same thing.

JGreenlee suggested changes Jan 6, 2025

View reviewed changes

pages/home.py Outdated Show resolved Hide resolved

pages/home.py Outdated Show resolved Hide resolved

Refactored get_number_of_active_users by combining it with find_last_…

9566ffa

…get. Reduced the bloat of operations by leaving only one for loop that does the same thing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactored retrieval of last API call timestamps to improve performance. #156

Refactored retrieval of last API call timestamps to improve performance. #156

TeachMeTW commented Jan 2, 2025

shankari commented Jan 2, 2025 •

edited

Loading

TeachMeTW commented Jan 2, 2025 •

edited

Loading

TeachMeTW commented Jan 2, 2025 •

edited

Loading

shankari commented Jan 2, 2025

TeachMeTW commented Jan 2, 2025

TeachMeTW commented Jan 2, 2025 •

edited

Loading

Refactored retrieval of last API call timestamps to improve performance. #156

Are you sure you want to change the base?

Refactored retrieval of last API call timestamps to improve performance. #156

Conversation

TeachMeTW commented Jan 2, 2025

Description

Benefits

Impact

Testing

Points of concern

shankari commented Jan 2, 2025 • edited Loading

TeachMeTW commented Jan 2, 2025 • edited Loading

TeachMeTW commented Jan 2, 2025 • edited Loading

Old

New

Takeaways

shankari commented Jan 2, 2025

TeachMeTW commented Jan 2, 2025

TeachMeTW commented Jan 2, 2025 • edited Loading

shankari commented Jan 2, 2025 •

edited

Loading

TeachMeTW commented Jan 2, 2025 •

edited

Loading

TeachMeTW commented Jan 2, 2025 •

edited

Loading

TeachMeTW commented Jan 2, 2025 •

edited

Loading