Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactored retrieval of last API call timestamps to improve performance. #156

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

TeachMeTW
Copy link
Contributor

Description

This PR refactors the process for retrieving the last API call timestamps to significantly improve performance. The previous implementation relied on a complex aggregation pipeline, which has been replaced with a simpler, iterative approach using profile db like previous enhancements.

Benefits

  • Improved Performance: Reduces database load and speeds up the retrieval process.
  • Simplified Logic: Enhances maintainability and readability of the code.
  • Scalability: Handles larger datasets more efficiently.

Impact

The changes result in faster execution and lower computational costs, particularly for scenarios with large user datasets like openaccess.

Testing

  • Tested by loading a dataset and seeing the overview page's Active Users
  • Logging checks

Points of concern

  • Do we have a way to manually make something an active user? I know the criteria according to the code and blame is one day delta. My naive approach is to manually change said ts to validate.

- Changed from using an aggregation pipeline on `get_timeseries_db` to a loop fetching data from `get_profile_db`.
- New approach iterates over `uuid_list`, fetching profile data and extracting `last_call_ts` for each user.
- Simplifies logic, avoids heavy aggregation, and reduces database load.

Results in a significant performance improvement.
@shankari
Copy link
Contributor

shankari commented Jan 2, 2025

@TeachMeTW I would like some more technical detail in the benefits and impact. I am not a general audience 😄

Can you please clarify:

@TeachMeTW
Copy link
Contributor Author

TeachMeTW commented Jan 2, 2025

@shankari Here's a breakdown of the clarification you requested:

  1. Which page loading time does this affect?
    This primarily affects the Overview page, specifically the Active Users card component.

  2. Is this related to the slow loading time on the Overview page (from Enhance _get_and_store_range Function to Include Trip Statistics and Last API Call Tracking e-mission-server#993 (comment))?
    Yes, it is. The update now leverages the last API call in the profiledb rather than relying on a MongoDB query, improving performance (since we already did processing during the pipeline run.

  3. What was the testing that you did with "Tested by loading a dataset and seeing the overview page's Active Users"?
    I tested this by:

    • Loading a dataset like openaccess.
    • Navigating to the Overview page.
    • Observing the loading time visually to estimate improvements.
      My next step is to load a specific UUID, update it, and verify that it's correctly classified as an active user. The current is pretty adhoc
  4. Was it just that the page loaded? Or was it that it was faster to load?
    The components load much faster now. Previously, the Active Users card was either nonexistent until fully loaded or appeared grey with skeleton loading (if implemented like in the future branch). This fix ensures the component loads with about the same speed as other components, improving the overall experience.

@TeachMeTW
Copy link
Contributor Author

TeachMeTW commented Jan 2, 2025

Old

Screen.Recording.2025-01-02.at.10.18.22.AM.mov

New

Screen.Recording.2025-01-02.at.10.19.27.AM.mov

Takeaways

As we can see, it is much faster. I would just need to test my hypothesis above in where loading a specific uuid and updating it would classify as an active user.

@shankari
Copy link
Contributor

shankari commented Jan 2, 2025

@TeachMeTW did you look through the code to check if there are any other sites that are making direct DB calls, or calls to the timeseries for profile data? When we finish a particular performance improvement, I want to get it done fully and not over multiple weeks. Each improvement is already carefully scoped to be small and self-contained.

@TeachMeTW
Copy link
Contributor Author

Hypothesis verified.

First I ran ./e-mission-py.bash bin/debug/load_timeline_for_day_and_user.py emission/tests/data/real_examples/shankari_2016-07-25 dummy_test in order to create a new user

Next I ran ./e-mission-py.bash bin/debug/intake_single_user.py -e dummy_test in order to load into the pipeline

Lastly, I ran

test_call_ts = time.time()
testUUID = uuid.UUID("UUID of the dummy_test")

enac.store_server_api_time(testUUID, "test_call_ts", test_call_ts, 69420)
etc.runIntakePipeline(testUUID)

to simulate an API call.

It does show as an active user:
image

@TeachMeTW
Copy link
Contributor Author

TeachMeTW commented Jan 2, 2025

@TeachMeTW did you look through the code to check if there are any other sites that are making direct DB calls, or calls to the timeseries for profile data? When we finish a particular performance improvement, I want to get it done fully and not over multiple weeks. Each improvement is already carefully scoped to be small and self-contained.

In the home page, I only found one Mongo query which was the active_users card which this pr addresses. Previously in db_utils, there were Mongo queries earlier, but they were replaced in a previous PR -- the user stats for the data page. I also checked other pages:

  • Data page: No Mongo queries found.
  • Map page: Looks good, no direct DB calls.
  • Push_notifs page: Also looks good, no issues.
  • Segment_trip_time page: No Mongo queries detected.

However, in db_utils, I noticed direct DB access in the query_segments_crossing_endpoints() function with the query:

not_excluded_uuid_query = {'user_id': {'$nin': [UUID(uuid) for uuid in excluded_uuids]}}

In db_utils there is also this chunk:

    # Vestigial code commented out and left below for future reference

    # logging.debug("Querying the UUID DB for %s -> %s" % (start_date,end_date))
    # query = {'update_ts': {'$exists': True}}
    # if start_date is not None:
    #     # have arrow create a datetime using start_date and time 00:00:00 in UTC
    #     start_time = arrow.get(start_date).datetime
    #     query['update_ts']['$gte'] = start_time
    # if end_date is not None:
    #     # have arrow create a datetime using end_date and time 23:59:59 in UTC
    #     end_time = arrow.get(end_date).replace(hour=23, minute=59, second=59).datetime
    #     query['update_ts']['$lt'] = end_time
    # projection = {
    #     '_id': 0,
    #     'user_id': '$uuid',
    #     'user_token': '$user_email',
    #     'update_ts': 1
    # }

    logging.debug("Querying the UUID DB for (no date range)")

    # This should actually use the profile DB instead of (or in addition to)
    # the UUID DB so that we can see the app version, os, manufacturer...
    # I will write a couple of functions to get all the users in a time range
    # (although we should define what that time range should be) and to merge
    # that with the profile data```

pages/home.py Outdated Show resolved Hide resolved
pages/home.py Outdated Show resolved Hide resolved
…get. Reduced the bloat of operations by leaving only one for loop that does the same thing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Ready for review by Shankari
Development

Successfully merging this pull request may close these issues.

3 participants