-
Notifications
You must be signed in to change notification settings - Fork 932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Implement experimental DataCollector API 2 #2024
Conversation
Performance benchmarks:
|
Thank you very much for doing this, the data collector API looks quite clean to me. It's good to have this here on GitHub and not only on matrix. However I am not sure opening up more and more PRs is a good way to bring the discussion forward. The implementation is missing a lot of parts, so that it's not even working on its own. I think this is intentional, but this way I don't see the added value of a PR. The API could have been presented in #1944, now the discussion is just spread further and further apart. Hope this doesn't sound too negative, but please just post the API in #1944, so we can reach consensus before running into implementations. I think we are on the right track |
I don't see the problem here. The link to this PR could be quickly added in #1944 or #1997, if you are concerned about keeping track of implementations/API variations. This PR is a summary of the discussion on Matrix, and would be too noisy to add the entire code to #1944, which is already too lengthy for anyone to read. The point I am bringing forward in this PR is the way the implementation sketch, not just the API, i.e. |
Added 2nd commit where the data collection is grouped by the groups. |
Rather than open another PR, I just put it here for discussion. In my datacollection branch, I started implementing my view of data collection as outlined in #1944. Note that this is still far from complete, but it works. For example, I haven't included the ideas of Group and Measures in this yet. Nor is the The short version of the API is given below. See 2 examples for details of how it currently works. model = EpsteinCivilViolence()
citizens = model.get_agents_of_type(Citizen)
cops = model.get_agents_of_type(Cop)
dc = DataCollector(model, [
collect_from("n_quiescent", citizens, "condition",
lambda d: len([e for e in d if e["condition"] == "Quiescent"])),
collect_from("n_active", cops, "condition",
lambda d: len([e for e in d if e["condition"] == "Active"])),
collect_from("jailed", citizens, "jail_sentence",
lambda d: len([e for e in d if e["jail_sentence"] > 0])),
collect_from("data", citizens, ["jail_sentence", "arrest_probability"])
])
dc.collect_all()
for _ in range(10):
model.step()
dc.collect_all()
print(dc.n_quiescent.to_dataframe().head())
print(dc.data.to_dataframe().head()) Note that the branch also contains some pub/sub stuff. This you can completely ignore. |
Needs to be a group, which is already agreed upon, because of agent creation/death/group identity change.
I disagree, as stated in #1944 (comment). Summary: there should be only one obvious way to select agents, i.e. via AgentSet. And the selections should have been defined in Group. On the Boltzmann wealth example: https://github.com/quaquel/mesa/blob/9256e6c1521857cef3d0bc20a10b44e2b13dbb5f/mesa/experimental/datacollection/examples/boltzmann.py#L96-L99, as a user, I find having to have a distinct Meta:
I think another PR would be better, because it looks like you want to repurpose ideas from #1933 and #1947 into Measure. A PR allows extensive inline comments. |
@rht Please read my original message carefully. Your response annoys me.
|
I saw the posted working branch with the expectation that it is going to be reviewed. Regarding with pub/stuff, the distinction is not clear. I'm not sure if |
Ok. I clearly indicated that it was WIP and not yet complete. Given that you repeatedly said that you want to push forward with this, rather than wait until I have time to complete this branch, I shared it as a source of inspiration. You make 3 points, two of which (Group/Measure, and pub/sub) were covered explicitly in my message. That leaves one relevant point. On the AgentSetCollector, I disagree. There is a clear distinction in behavior between collecting data from an AgentSet and collecting data from another object. So, it makes sense to have separate classes for this. How to expose this to the user is, of course, a relevant concern, which is why I added the It might be that once I add Measure and Group, the AgentSetObserver becomes completely redundant, but we'll see. |
Let's return to this PR and the broader discussion on data collection. I am trying to understand the difference between Group and Measure. Conceptually, I understand a Measure as some observable model state at a particular time instant that is a function of internal model objects (e.g., agents, space, etc.). It also seems that a Measure is not otherwise used by the model itself internally because, in that case, it could have been implemented as a simple attribute. So, Gini in the Boltzman Wealth Model would be a Measure because it is not used internally. Conceptually, I understand a Group as a collection of agents that meet some particular selection criterion at a particular time instant. Key to the notion of a Group seems to be that membership is dynamic over time. So, in the Epstein model, citizens and cops are not groups because their membership does not change over time. In Wolf/Sheep, however, wolves and sheep might be groups because their membership changes over time (although you can always fall back on Is this in line with everyone else's understanding? Then, looking at the current Group and Measure code provided in this PR, I failed to square it with my conceptual understanding. The current code for both classes does basically the same thing: apply a callable to an object. Am I missing something, or is the current implementation just a very early draft? |
I admit I was being too impatient with the data collection progress. Regardless, thank you for reviewing the draft (I accidentally opened it as a non-draft). I agree regarding with the distinction between a measure and an attribute. A measure is not necessary to advance the system state.
I consider the citizens and cops in the Epstein model to be a group. My definition of a group is somewhat different: an AgentSet that is a result of a Regarding with getting the measure as a callable, there is a performance problem: |
That's a helpful clarification. I have to think a bit more about how to implement this. But in essence, a group is an AgentSet, so I am not sure we need a separate class for the group. We do however need some way of capturing the combination of the name and the optional callable that returns the group. This might be a a class or possibly just a simple function.
I was playing with Measure this morning, and I was thinking exactly the same thing. That is, I was thinking of adding an |
How I think about it is that a static group is just an AgentSet, while a dynamic group is something that returns an AgentSet. So we should be able to handle both equally with respect to data collection, but I think its a nice little distinction. Btw, is it possible to have something like static_group = model.agents.select(...)
dynamic_group = Group(...)
static_group => AgentSet
dynamic_group => AgentSet What I mean is calling both groups by their name should return an AgentSet. Is that somehow possible? Maybe we can subclass AgentSet in Group, but keep the dynamic nature? Just tossing in ideas here. I think that way we could get away with data collector only handling two types of value: Model attributes/measures and AgentSets/Groups or more abstract objects and sequences of objects
I would be cautious with using steps as a cache key. I would expect measure to always return the current value of the measure. It depends on how much statistics we want to provide for measure, but while the idea is to use it primarily for data collection, I wouldn't exclude them from being used inside the model logic. And some measure might change within-step. I know it also depends on how you define step, but so far this is completely inside the hands of the users. For example in some of my models I used step 0 as a calibration step where the model was doing some things until an equilibrium was reached and only than the "real" steps started. Using measures to, well, measure the equilibrium would have been nice, but wouldn't be possible if they are cached to the model step. |
I see the problem. There is no way to prevent the user from using the measures as if they were internal attributes that are changed several times within a step. One way to do it is to not cache by default, only to cache if the user wants to optimize for performance, and the user may specify their own cache key, just like a Solara's component's |
Here is an example of the dependencies for Solara components: mesa/mesa/experimental/jupyter_viz.py Lines 28 to 30 in 9d4f6e1
components_matplotlib.SpaceMatplotlib is updated only when the dependencies change.
|
We are thinking along similar lines. The issue with attribute-based access while still applying some logic within the attribute access is solvable through properties. And, as we discussed in the pub/sub context, this can be done dynamically. So yes, I believe it is possible to make this work.
I was also concerned about this issue. One solution would be to have some The other option is to use pub/sub. This resolves the entire problem because Measure would always reflect the current underlying information on which it operates. To be clear: I favor first developing this new data collection mechanism without pub/sub, while designing it such that we shift to it later if we decide to add it. In fact, it might be an easy way for me to show its value as reqeusted/suggested by @ewout in #1947. |
Closing in favor of #2199. This implementation is a no-go because |
This is a sketch of another data collector API as discussed on Matrix.org. The implementation is very short, so as to give an idea of how it works, and to experiment with it.
Highlights:
model
, and so its implementation is very simpleDownside: to group-by the data collection by group is not automatic, since the organization is completely flat, with no information about the group. OTOH, doing group-by in #2013 is very trivial.no longer the case. Already implemented.