Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYNPY-1322] Object Orientated Programming Interfaces #1013

Merged
merged 34 commits into from
Jan 22, 2024

Conversation

BryanFauble
Copy link
Contributor

@BryanFauble BryanFauble commented Nov 16, 2023

Background:
Working with the Synapse python client is a mix of many competing technologies, frameworks, and documentation.

Frameworks used:

  1. dataclasses: https://docs.python.org/3.8/library/dataclasses.html
  2. asyncio: https://docs.python.org/3.8/library/asyncio-dev.html

Performance insights:
Some initial testing with the benchmark script herehttps://github.com/Sage-Bionetworks/synapsePythonClient/blob/develop/docs/scripts/benchmark.py - Using these new classes

Test Synapseutils Sync os.walk + syn.store S3 Sync ASYNC/OOP Per file size
25 Files 1MB total size 10.43s 8.99s 1.83s 5.45s 40KB
775 Files 10MB total size 243.57s 257.27s 7.64s 123.99s 12.9KB

Examples of how this OOP approach works:
See all of the scripts created in the test_scripts/ folder. ie: https://github.com/Sage-Bionetworks/synapsePythonClient/blob/SYNPY-1322-OOP-POC/docs/scripts/object_orientated_programming_poc/oop_poc_project.py

Things left to do/investigate (Will be completed in follow-up epics):

  1. Would swapping away from the requests library (Like httpx: https://www.python-httpx.org/advanced/ ) lead to some better performance and async programming? Answer: Yeshttps://sagebionetworks.jira.com/browse/SYNPY-1411
  2. All of the todos: https://sagebionetworks.jira.com/browse/SYNPY-1343
  3. Any other domain models that we want to implement: https://sagebionetworks.jira.com/browse/SYNPY-1343

@pep8speaks
Copy link

pep8speaks commented Nov 16, 2023

Hello @BryanFauble! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 26:89: E501 line too long (110 > 88 characters)
Line 27:89: E501 line too long (106 > 88 characters)
Line 28:89: E501 line too long (107 > 88 characters)
Line 29:89: E501 line too long (100 > 88 characters)

Line 346:89: E501 line too long (101 > 88 characters)
Line 490:89: E501 line too long (109 > 88 characters)
Line 491:89: E501 line too long (112 > 88 characters)
Line 1042:89: E501 line too long (111 > 88 characters)
Line 1043:89: E501 line too long (106 > 88 characters)
Line 1513:89: E501 line too long (102 > 88 characters)
Line 1911:89: E501 line too long (111 > 88 characters)
Line 1912:89: E501 line too long (106 > 88 characters)
Line 1935:89: E501 line too long (92 > 88 characters)
Line 2334:89: E501 line too long (111 > 88 characters)
Line 2335:89: E501 line too long (106 > 88 characters)
Line 4987:89: E501 line too long (111 > 88 characters)
Line 4988:89: E501 line too long (106 > 88 characters)
Line 5852:89: E501 line too long (111 > 88 characters)
Line 5853:89: E501 line too long (106 > 88 characters)

Line 18:89: E501 line too long (109 > 88 characters)
Line 64:89: E501 line too long (92 > 88 characters)

Line 26:89: E501 line too long (92 > 88 characters)
Line 29:89: E501 line too long (92 > 88 characters)
Line 34:89: E501 line too long (90 > 88 characters)
Line 35:89: E501 line too long (97 > 88 characters)
Line 41:89: E501 line too long (101 > 88 characters)
Line 42:89: E501 line too long (91 > 88 characters)
Line 47:89: E501 line too long (103 > 88 characters)
Line 49:89: E501 line too long (96 > 88 characters)
Line 51:89: E501 line too long (94 > 88 characters)
Line 68:89: E501 line too long (93 > 88 characters)
Line 77:89: E501 line too long (92 > 88 characters)
Line 172:89: E501 line too long (110 > 88 characters)
Line 180:89: E501 line too long (96 > 88 characters)
Line 218:89: E501 line too long (114 > 88 characters)
Line 230:89: E501 line too long (110 > 88 characters)
Line 255:89: E501 line too long (110 > 88 characters)

Line 25:89: E501 line too long (94 > 88 characters)
Line 26:89: E501 line too long (93 > 88 characters)
Line 30:89: E501 line too long (97 > 88 characters)
Line 31:89: E501 line too long (97 > 88 characters)
Line 39:89: E501 line too long (96 > 88 characters)
Line 41:89: E501 line too long (94 > 88 characters)
Line 109:89: E501 line too long (91 > 88 characters)
Line 110:89: E501 line too long (89 > 88 characters)
Line 111:89: E501 line too long (96 > 88 characters)
Line 130:89: E501 line too long (90 > 88 characters)
Line 142:89: E501 line too long (110 > 88 characters)
Line 148:89: E501 line too long (96 > 88 characters)
Line 221:89: E501 line too long (110 > 88 characters)
Line 278:89: E501 line too long (110 > 88 characters)

Line 25:89: E501 line too long (95 > 88 characters)
Line 26:89: E501 line too long (97 > 88 characters)
Line 30:89: E501 line too long (94 > 88 characters)
Line 31:89: E501 line too long (101 > 88 characters)
Line 39:89: E501 line too long (96 > 88 characters)
Line 40:89: E501 line too long (95 > 88 characters)
Line 41:89: E501 line too long (106 > 88 characters)
Line 95:89: E501 line too long (92 > 88 characters)
Line 164:89: E501 line too long (110 > 88 characters)
Line 238:89: E501 line too long (110 > 88 characters)
Line 295:89: E501 line too long (110 > 88 characters)

Line 56:89: E501 line too long (89 > 88 characters)
Line 62:89: E501 line too long (89 > 88 characters)
Line 66:89: E501 line too long (89 > 88 characters)
Line 176:89: E501 line too long (103 > 88 characters)
Line 307:89: E501 line too long (113 > 88 characters)
Line 311:89: E501 line too long (96 > 88 characters)
Line 341:89: E501 line too long (89 > 88 characters)
Line 347:89: E501 line too long (90 > 88 characters)
Line 348:89: E501 line too long (97 > 88 characters)
Line 351:89: E501 line too long (95 > 88 characters)
Line 357:89: E501 line too long (98 > 88 characters)
Line 360:89: E501 line too long (95 > 88 characters)
Line 362:89: E501 line too long (94 > 88 characters)
Line 388:89: E501 line too long (92 > 88 characters)
Line 476:89: E501 line too long (110 > 88 characters)
Line 502:89: E501 line too long (110 > 88 characters)
Line 527:89: E501 line too long (110 > 88 characters)
Line 535:89: E501 line too long (91 > 88 characters)
Line 536:89: E501 line too long (92 > 88 characters)
Line 538:89: E501 line too long (89 > 88 characters)
Line 594:89: E501 line too long (90 > 88 characters)
Line 609:89: E501 line too long (110 > 88 characters)
Line 614:89: E501 line too long (105 > 88 characters)
Line 627:89: E501 line too long (90 > 88 characters)
Line 633:89: E501 line too long (110 > 88 characters)
Line 660:89: E501 line too long (110 > 88 characters)
Line 669:89: E501 line too long (118 > 88 characters)

Line 505:89: E501 line too long (93 > 88 characters)

Comment last updated at 2024-01-22 20:36:28 UTC

"""
annotations_dict = asdict(annotations)

# TODO: Is there a more elegant way to handle this - This is essentially being used
Copy link
Contributor

@BWMac BWMac Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another dataclass library that adds some more functionality. Its asdict function has an exclude argument that you can pass to leave out class attributes you don't want in your dictionary:

from dataclasses import dataclass
from dataclass_wizard.dumpers import asdict

@dataclass
class Foo:
    bar: str
    baz: int

foo = Foo('bar', 42)
print(asdict(foo, exclude=['baz']))

> {'bar': 'bar'}

You'd still be hard-coding the excluded values though. Not sure of any other ways aside from implementing an asdict method specific to the class that excludes attributes not used in the API.

Edit: if you used a base class you could potentially implement an asdict function that could be reused across all extending classes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea - but i'm starting to think in the other direction now. What I mean by this is specify only those things I want to include and have a dataclass -> json/dict mapping process to format the input for the REST API as the API is expecting.

filtered_dict = {k: v for k, v in annotations_dict.items() if k != "is_loaded"}

# TODO: This `restPUT` returns back a dict (or string) - Could we use:
# TODO: https://github.com/konradhalas/dacite to convert the dict to an object?
Copy link
Contributor

@BWMac BWMac Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like the look of dacite, but is that more efficient than just wrapping the response in the class object like we have done elsewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure yet - I'll have to give it some more thought.

One of the things we will need to do it at least have a thin translation layer between the REST api and the dataclasses because the names we are giving them in the python client are snake_case, vs the REST api is all in camelCase.

@BWMac
Copy link
Contributor

BWMac commented Nov 21, 2023

@BryanFauble Have you thought about having some sort of BaseClass that these new classes could extend? It seems like many have a lot of attributes in common and some methods (at least the names) too.

@BryanFauble
Copy link
Contributor Author

@BryanFauble Have you thought about having some sort of BaseClass that these new classes could extend? It seems like many have a lot of attributes in common and some methods (at least the names) too.

@BWMac Yes, however inheritance is an easy way to have large refactoring down the line for simple changes. Especially in cases where a base class is a part of many parent classes. IMO The easiest way I've seen to deal with it is to just replicate things where they are needed. I do understand it adds some duplicated code, but the duplication far out-weighs the negatives.

@thomasyu888
Copy link
Member

@BryanFauble , @BWMac

I definitely agree that we should use inheritance carefully. In the case of Entities, technically FileEntities do derive from Entities, so it could make sense for that to use inheritance to avoid the duplication of code across 7 entities: https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/Entity.html and having to update the code 7 separate times if something changes. Each implemenation comes with its pros/cons.

It definitely makes sense to use MixIns/composition for different functionality for entities like

  • Versionable
  • set ACL
  • etc

@BryanFauble
Copy link
Contributor Author

@thomasyu888 @BWMac

In the case of Entities, technically FileEntities do derive from Entities, so it could make sense for that to use inheritance to avoid the duplication of code across 7 entities: https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/Entity.html and having to update the code 7 separate times if something changes.

True - But also a recent example I came across is that a Project inherits an Entity, but parentId doesn't mean anything for a Project since it's treated as the highest level container of stuff. These are the kind of exceptions to the rule that make inheritance hard.

I do like composition though to contain common functionality.

Excellent discussion, let's continue this further when I get more models in place and we can combine as we see fit when we find overlapping functionality.

Copy link
Contributor

@BWMac BWMac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I found the example scripts to do a good job of demonstrating the power of the OOP changes. As a user I would much prefer this way when using the client within other python packages that I work on.

test_scripts/oop_poc_file.py Outdated Show resolved Hide resolved
test_scripts/oop_poc_file.py Outdated Show resolved Hide resolved
test_scripts/oop_poc_folder.py Outdated Show resolved Hide resolved
test_scripts/oop_poc_project.py Outdated Show resolved Hide resolved
# Querying for data from a table =====================================================
destination_csv_location = os.path.expanduser("~/temp/my_query_results")

await Table.query(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somthing about using a method from the Table class outside the context of an instance of that class, but then feeding an attribute of the instance copy_of_table rubs me the wrong way.

Not sure how else it should be done though since you have to pass the query string as an argument to the query function somehow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is one of the things that I was struggling to figure out the best practice for. One thing to clarify is that we are not passing an instance of copy_of_table, in this case I am just grabbing the ID of the table. This is equivalent to these lines:

    table_id_to_query = copy_of_table.id
    await Table.query(
        query=f"SELECT * FROM {table_id_to_query}",
        result_format=CsvResultFormat(download_location=destination_csv_location),
    )

Copy link
Contributor

@BWMac BWMac Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, not an instance itself, but an attribute of that instance.

It would feel better if you could do something like

copy_of_table.query(query_args)

So that it reads like you are querying the table you have already. But formatting the query string could be weird and it's not like we don't want to let people do Table.query to query a table already in Synapse that they don't care to read into a Python object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very much a future work/off topic thing, but what if we didn't make people pass the whole query string themselves?

We could use something like jinjasql to allow people to pass parameters that the Table class could format into a query. Then someone could either pass the id of a table they want in an uninstantiated case or when there is an instance of the Table class we are already using, we just grab the id on the back end and plug it into the query.

Unfortunately, it looks like jinjasql has not been maintained for a while, but maybe there is something else out there that could provide similar functionality.

Copy link
Contributor Author

@BryanFauble BryanFauble Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting library. My concerns when we get to solutioning on this are around continuing to provide the same if not more functionality for folks querying for data.

I had considered something like a "reference to my class instance" string that you could add into a query like:
select * from :my-table-instance-id, but that relies on those querying for the data to add in that string constant.

By not tying the query to a specific instance of a Table it makes things like joins easier to work with. SQL can be very complex and it'd be hard to implement everything we want into a Python class or semi-structured way to doing it.

Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 I'm going to pre-approve here. Overall, I think we can refactor the syncToSynapse function which will help the upload speeds.

This will also make it as seamless of a transition as possible.

synapseclient/models/annotations.py Show resolved Hide resolved

is_loaded: bool = False

def convert_from_api_parameters(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a generalization camel case to snake case converter?

test_scripts/oop_poc_file.py Outdated Show resolved Hide resolved
Copy link

Quality Gate Passed Quality Gate passed

The SonarCloud Quality Gate passed, but some issues were introduced.

54 New issues
3 Security Hotspots
No data about Coverage
7.8% Duplication on New Code

See analysis details on SonarCloud

@BryanFauble BryanFauble changed the title DRAFT: [SYNPY-1322] Object Orientated POC [SYNPY-1322] Object Orientated Programming Interfaces Jan 22, 2024
@BryanFauble BryanFauble marked this pull request as ready for review January 22, 2024 18:08
@BryanFauble BryanFauble requested a review from a team as a code owner January 22, 2024 18:08
Copy link
Contributor

@BWMac BWMac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple minor comments but LGTM!

synapseclient/models/annotations.py Show resolved Hide resolved
synapseclient/models/file.py Outdated Show resolved Hide resolved
synapseclient/models/folder.py Outdated Show resolved Hide resolved
@BryanFauble BryanFauble merged commit 5f2e2b7 into develop Jan 22, 2024
29 checks passed
@BryanFauble BryanFauble deleted the SYNPY-1322-OOP-POC branch January 22, 2024 20:41
Copy link

Quality Gate Passed Quality Gate passed

The SonarCloud Quality Gate passed, but some issues were introduced.

103 New issues
3 Security Hotspots
2.4% Coverage on New Code
8.5% Duplication on New Code

See analysis details on SonarCloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants