Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: MD-collected data vs community-collected data #68

Open
YoniSchirris opened this issue Apr 6, 2020 · 22 comments
Open

Discussion: MD-collected data vs community-collected data #68

YoniSchirris opened this issue Apr 6, 2020 · 22 comments
Assignees

Comments

@YoniSchirris
Copy link

An issue that popped up since our repo is public, that anyone with the APK can upload data to the DB while putting in fake ground-truth values.

Currently, we have no way of controlling for this.

How can we have a sense of "authentication" without actual authentication?

Ideas:

  • Create a private APK for the version that we send to MDs
  • This private APK can write to a separate DB, or...
  • This private APK can add a flag in one of the JSON files that states that this is a "verified" measurement, meaning that when downloading / processing we can filter bij this flag
@MalcolmMielle
Copy link

MalcolmMielle commented Apr 6, 2020

I think it will be important later to be able to separate the data collected from the MD from other data if we end up releasing th ecollection app to more people.
I'm not sure which strategy would be best for this.

@haggy
Copy link
Collaborator

haggy commented Apr 6, 2020

I've been thinking about this for an hour or so and I have a couple of questions.

  1. How does the access pattern differ between MD and community sourced data? For example Are you going to want to do things like consume the two datasets completely independent of each other (to train independent models, for example)?
  2. What does the onboarding process for clinical professionals look like?

I think the harder question to answer is "How to we restrict app packages based on who's using them?". If we're on-boarding clinical professionals manually, we can just have an obfuscated link for them (a link with ID/Hash values that are hard or impossible to just guess) and a "regular" link for anyone else (testers, random users that stumble on it, etc). We would then only share the obfuscated link with the pros.

The actual layout of the data in S3 is dependent on the answer to question 1 but we have a lot of flexibility. Without knowing the answer to question 1 I see two possible ways to isolate the incoming data streams:

A multi-bucket strategy:

covital-videos-pro-staging/year=<year>/month=<month>/day=<day>/survey-id=<id>/...
covital-videos-patient-staging/year=<year>/month=<month>/day=<day>/survey-id=<id>/...

Or a single bucket strategy:

covital-videos-staging/pro/year=<year>/month=<month>/day=<day>/survey-id=<id>/...
covital-videos-staging/patient/year=<year>/month=<month>/day=<day>/survey-id=<id>/...

I personally think a single bucket strategy is going to reduce complexity and increase flexibility so again without knowing the answer to question 1 I'd suggest that approach.

@MalcolmMielle
Copy link

  1. We might want to train independent models. I think what we might end up doing, for example, is train on mixed dataset but then run the test only against the "pro" dataset to be sure we get our results on valid data.
  2. We will distribute the apk to the MD by making available through a download link (the apk will be on a google drive).

Can we not simply request a signedURL for PRO and a different signedURL for others? for example in the openapi, we could simply send a bool true/false depending on who is using it and then use the single bucket strategy?

@haggy
Copy link
Collaborator

haggy commented Apr 7, 2020

Can we not simply request a signedURL for PRO and a different signedURL for others

Yes at the API level that's fine but you still need to know at the app level what flag(s) to send on the API call to let the backend know what kind of user is inputting the data. I can easily add a role property to the API that defines something like clinical and patient but you're going to need an APK per "role" so that the app the MDs install know to pass a role of clinical instead of patient. Does that make sense?

@kaeawc
Copy link

kaeawc commented Apr 7, 2020

We can definitely create and distribute multiple APKs with different versions and signatures. That's straightforward.

Another option could be that MDs must authenticate and other users are not required to. From talking with CJ it sounded like we were going with Auth0 to handle authentication - at that point we can ask MDs to contact us for validation before they're approved.

One of the questions that was raised was about API secrets - we can hide credentials, secrets, etc in things like Vault while still being completely open source. That's just one option, and maybe should be a separate discussion from this thread.

@haggy
Copy link
Collaborator

haggy commented Apr 7, 2020

@kaeawc Authn is part of the "phase 2" (user-facing app) initiatives. There is additional work involved to get the authn flow right between all the apps and backend so it was decided that the data collection app would be "auth-less". Speed to launch (for the collection app) is critical because product is worried that doctors could soon be too overwhelmed to collect data using the app (which is understandable).

I think that having multiple APK's, one specifically for clinical pros and another for everyone else is good enough for the phase 1 collection app. If that sounds good then I can add a new property to the signed-upload API that allows the app to direct the signed uploads appropriately. Here is what Im thinking:

{
  "role": "clinical | patient"
}

Thoughts?

@kaeawc
Copy link

kaeawc commented Apr 7, 2020

Ahk, gotcha. In that case the variants idea seems like the best option.

@MalcolmMielle
Copy link

Yeah the auth version sounds great but the variant is much more straight forward so let's go with that.

If we see a real use in having built that dataset and that the MD are really keen to help maybe we can have a v2 where the MDs have to auth and verify with us so that they can participate in building the dataset. But that would be for later and if the project worked.

Question: is it possible to "kill" the signed url upload for the app later? Let's say we want to collect data only for a certain time using this app, can we make it so it won't upload anymore when we decide we have enough data?

@haggy
Copy link
Collaborator

haggy commented Apr 7, 2020

@MalcolmMielle we can disable the API(s) in various ways so that's not a problem.

OK so to summarize:

  1. We will build/package multiple APKs and distribute them based on who is using them (one for clinical pros, another for regular users/patients)
  2. The signed upload API will include a new field called role that the app will set depending on which build it is. Valid values are clinical and patient.

Do we all agree on that method?

@YoniSchirris Does this answer your question?

@MalcolmMielle
Copy link

Yes that sounds good! Although maye the role values should be "clinical" and "community" to disctinguish the two cases ?

@haggy
Copy link
Collaborator

haggy commented Apr 7, 2020

Sure works for me 😄

@YoniSchirris
Copy link
Author

Just to restate Malcolm's comment: Patients won't use this app. This is all for the data collection application. However, the data collected can be done by professional MDs, or by community members :)

The above sounds good. Thanks @haggy

@MalcolmMielle
Copy link

@haggy Is the role in the API already :)? I've pulled the latest version but I may be missing it from the doc.

@haggy
Copy link
Collaborator

haggy commented Apr 10, 2020

@MalcolmMielle Not yet. Also Im going to change the property name from role to source. It fits its purpose much better. role also could conflict with future authorization functionality.

Property name: source
Values: community | clinical

I'm going to add that in tonight and I'll let you know when it's done.

@haggy
Copy link
Collaborator

haggy commented Apr 10, 2020

image

New key schema ^^. Local tests are going well so Im going to get this pushed to the sandbox.

@haggy
Copy link
Collaborator

haggy commented Apr 11, 2020

FYI @MalcolmMielle Im having some weird issues with the build pipeline. Working through those now.

@MalcolmMielle
Copy link

Ok, should I hold off updating the app yet? I'll wait until you tell me it's ok to update the upload func :). Thanks for the great job

@haggy
Copy link
Collaborator

haggy commented Apr 11, 2020

Yea hold off for now. Waiting for GH to fix things on their side. Will post here when released.

I'm assuming the app is built against staging?

@MalcolmMielle
Copy link

I usually pull master, compile the dart library and push it to this repo to use in the app.

@haggy
Copy link
Collaborator

haggy commented Apr 11, 2020

@MalcolmMielle GH has fixed the issues and CI is back online. Im testing on Sandbox and will deploy to staging soon.

@haggy
Copy link
Collaborator

haggy commented Apr 11, 2020

@MalcolmMielle @YoniSchirris Released to staging!
https://guarded-crag-28391.herokuapp.com/api-docs/#/default/batch-signed-upload-req

Please test and let me know if you run into any issues :D

@MalcolmMielle
Copy link

Will do either tonight or tomorrow! :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants