Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume algorithm execution #28

Open
bdelespierre opened this issue Sep 3, 2021 · 6 comments
Open

Resume algorithm execution #28

bdelespierre opened this issue Sep 3, 2021 · 6 comments
Milestone

Comments

@bdelespierre
Copy link
Owner

I believe it would be nice to be able to resume algorithm execution after its completion. It could be useful as new points are being added so previous iterations don't need to be re-run again.

Example: I have clustered my 100 000 users into 5 clusters. Since the last clustering, 100 new users have been added. Most of them are probably already very close to the existing clusters' centroids. Hence, I should be able to resume clustering the same dataset PLUS the new users to save time.

@battlecook
Copy link
Contributor

It would be good to provide this function as an option.
Because the added points will also affect the creation of the cluster.
Clustering 100000 users and clustering 100100 users may have different results.
So it would be nice to have 2 options when using the library.

  1. After 100000 users are clustered, 100 additional users are clustered
  2. Re-clustering 100100 users

@bdelespierre
Copy link
Owner Author

bdelespierre commented Sep 6, 2021

Yes. I would propose something like:

$algo = new Kmeans\Algorithm:(new Kmeans\RandomInitialization());

$result = $algo->clusterize($points, $nbClusters);

$serialized = serialize($result);

// later...

$previousRun = unserialize($serialized);

$result = $previousRun->resume($newPoints);

@battlecook
Copy link
Contributor

looks good 👍

@bdelespierre
Copy link
Owner Author

I've been thinking about a result object for Algorithm::clusterize. What do you think of this API?

<?php

namespace Bdelespierre\Kmeans\Interfaces;

interface ClusterizationResultInterface extends \Serializable
{
    public function hasReachedConvergence(): bool;

    /**
     * @return int<0, max>
     */
    public function iterationsCount(): int;

    public function getClusters(): ClusterCollectionInterface;

    public function resume(PointCollectionInterface $newPoints): self;
}

@battlecook
Copy link
Contributor

Sorry for checking late. (I confirmed that it was committed to pr.)

I think it's fine. But I think we'll have to do some more work to be more confident about the interface design.

@bdelespierre
Copy link
Owner Author

It's not implemented in #27. I plan to implement that later

@battlecook battlecook mentioned this issue Sep 22, 2021
Draft
@bdelespierre bdelespierre added this to the v3.0 milestone Mar 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants