-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Real Storage asv perf POC #2165
base: master
Are you sure you want to change the base?
Conversation
python/arcticdb/util/utils.py
Outdated
self.__types[name] = dtype | ||
return self | ||
|
||
def add_float_col_ex(self, name: str, min: float, max: float, round_at: int, dtype: ArcticFloatType = np.float64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a lot of duplicated code here, why do we need both the _col
and _col_ex
version of these functions?
Also we already have similar functionality here can we reuse/extend that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you cannot have in python 2 functions with different parameter sets, hence there are 2 main scenarios for usafe float and integer function - one just get random vals (implicitly it means whole set of vals from -min to +max) and the other scenario is I need only between ceratin interval , and for floats additional opt that you do not care for more than X digits, which could be important.
for each scenario there is a separate function to limit options only to those that matters. Usually in C# etc functions that have most arguments are with "_ex" ending, here I was not exactly sure what to add. But definately from suportability perspective going away from standard python pattern where you go and handle all in code is perhaps not best
Also we already have similar functionality [here]
What I could I reused. What I should not touch I tried not to touch, instead it is more clean approach to start from clean start, especially if you look at floats ...
Overall, what is is there is something used here and there and should be change with quite much care ...
At the other hand I think we need new more systematic approach which is aligned with intelisense and grouping operations in related classes of utility functions. This would be more evident and readily available and prevent mistakes.
Therefore going forward in more complex scenarios this approach will show its benefits through faster writing tests with more steps
for lib in self.__libraries: | ||
ac.delete_library(lib) | ||
|
||
def get_arctic(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd call this get_arctic_client
|
||
|
||
## Amazon s3 storage URL | ||
AWS_URL_TEST = 's3s://s3.eu-west-1.amazonaws.com:arcticdb-asv-real-storage?aws_auth=true' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer if you use the s3 fixture that we have https://github.com/man-group/ArcticDB/blob/master/build_tooling/transform_asv_results.py#L94
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reason for using separate fixture is that there will be cases when we delete libraries. In accident we might delete library for asv results. Thus keeping the set separate other data is best. There could be times when we could wipe all data. If we combine those we will have to take care of what is used and for what reason which complicates things.
lib_name = self.get_library_names(num_symbols)[1] | ||
self.__libraries.add(lib_name) | ||
if lib_opts is None: | ||
return ac.get_library(lib_name, create_if_missing=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need for the if here, can be just:
return ac.get_library(lib_name, create_if_missing=True,
library_options=lib_opts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True - old Java habbits
""" | ||
return None | ||
|
||
def get_library_names(self, num_symbols: int = 1) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be an abstractmethod?
## Remove only LMDB libs as they need to | ||
if self.type == Storage.LMDB: | ||
ac = self.get_arctic() | ||
for lib in self.__libraries: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what is the point of __libraries?
Wouldn't it be easier to just call ac.list_libraries()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep track of what libraries are created. To be deleted at the end of asv tests by asv evntually when I find the right [;ace to do that, not to clutter shared storages (or those on our WSLs)
Currently libs are not deleted when LMDB is used which is ok for Transient GH runner but not optimal yet for us.
And for shares storages we have to take care of libs we do not need, first moment is when they are not needed at end of asv tests. That however might be tricky to do ...
Option 1: The class wipes data after tests are done - this example ... yet due to asv neture to open many processes there is no hook yet known to me "at end of all tests". So that work is still on progress if there is any way to do it will do it
Option 2: If no solution is founf we might need to have a new job invoked after this job to wipe out all "MOD_" and "TEMP_" libraries, or the keep only "PERM_" libs, last one is best IMO. (It might not be a job but a python script in current job activated when real storage is used)
Option 2a: for personal WSL wipe out /tmp on regular bases should be practiced but I will leave that engineers for now
""" | ||
assert confirm, "deletion not confirmed!" | ||
ac = self.get_arctic() | ||
list = ac.list_libraries() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: library_list ot lib_list would be a better name.
Also this can be written as lib_list = [lib for lib in ac.list_libraries() if lib.startswith(nameStarts)]
Then you won't need the other check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed the var but prefer to keep the current way of implementation. Reason:
- delete is destructive operation and must always print what is deleted. That is the reason I added first line 'asser confirm' and I will add also additional check for lenth of string required > 4. So that by accident we do not delete anything that we do not want.
Yet this function is not used yet, but will be at proper time if I find solution to above problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
python/benchmarks/real_read_write.py
Outdated
SETUP_CLASS: ReadBenchmarkLibraries = ReadBenchmarkLibraries(Storage.LMDB) | ||
|
||
params = SETUP_CLASS.get_parameter_list() | ||
param_names = ["num_rows"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you are going to have get_parameter_list()
, it might be more consistent to also have get_parameter_names_list()
|
||
#region Setup classes | ||
|
||
class SymbolLibraries(LibrariesBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused as to why we need these 2 classes SymbolLibraries and VersionLibraries.
They seem to do the same thing, but the versions one just writes more version and has a different set of parameter.
I think this can be refactored to be more concise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the versions currently depend on number of symbols.
While number of symbols has linear growth, the number of versions grow in a symbol grow as sum(1..n) where n is the number of symbol. With just 50 symbols you get > 300 version, 100 > 5000
For test of list symbols you need more symbols usually thousands.
Creating version library for 1000 is very time consuming and expensive, we could do that once and perhaps will do that as we have permanent storage, but initial generation will be huge.
Why I have done the versions in this model?
- it is close to real life - each symbol has different number of versions
- you have symbol with max number of versions which is another test, it can be easily created anoter test for n/2 versions for symbol
Still we have not decided yet to use the persistance feature or not and regenerate each test data when we need it (I think we should use persistance feature)
Reference Issues/PRs
Main idea of this set is to provide repeatable set of performance measurments against persistent storage initially ASW S3 and later any other storage that we support.
Different types of storage provide different type of performance and characteristics. Thus having one and the same asv benchmark test to do both is far from optimal
Persistent storages provide a way to setup once the data needed for the tests (especially for read tests/ those that do not modify anything)
At the same time they should isolate that data from other other tests - that do writes by providing a temporary storage for their use
The approach to build this kind of test setup is first to separate the logic for setting up the libraries and symbols data from the actual asv. In other words each test should use anothe class which is specialized for dealing with setting up data, making some checks and wiping it out if necessary.
Having a setup class that is not part of asv test provides way to controlled setting up environment on demand and wiping it out, and all other stuff. This class can fascilitate easily that logic for all types of storages.
In this model the asv test becomes simple to write.
Most importantly the tests are now ASV independent, then can be quickly moved to other tool etc if needed
It also gives opportunity to use parts of presetup logic for tests outside of asv itself. In other words the persistent part can be shared accross different types of tests.
Adding later new storage type (azure etc) is very easy (provided that it is shared type of storage similar to aws):
What does this implement or fix?
Any other comments?
Checklist
Checklist for code changes...