Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C2D refactor #818

Open
wants to merge 10 commits into
base: feature/c2d_docker
Choose a base branch
from
Open

C2D refactor #818

wants to merge 10 commits into from

Conversation

alexcos20
Copy link
Member

@alexcos20 alexcos20 commented Jan 23, 2025

BREAKING CHANGES

Node config

We will use DOCKER_COMPUTE_ENVIRONMENTS as a definition of Docker engines & compute envs.
The full definition is withing interface C2DDockerConfig

free key is optional, and used to define what resources are available for free within a compute env.

There are no more free and non free compute envs. Just one for each docker engine.

When calling startFreeCompute, node will use resource restrictions from free config, if free key is defined. if not, startFreeCompute will throw an error.
When calling startCompute, node will use resource restrictions and will ignore free config.

This will give us some advantages:

  • easier config
  • resource monitoring
  • at job startup, we can count all used resources and decide if we really have the resources to start the job or not

Resources

Every compute env has resources. A resource is defined like:

{
  id: ComputeResourceType
  type?: string
  kind?: string
  total: number // total number of specific resource
  min: number // min number of resource needed for a job
  max: number // max number of resource for a job
  inUse?: number // for display purposes
}

There are couple of hardcoded: cpu, ram and disk, and they can be extended by node owner
Disk resource has to be specified , as for now, there is no safe method of getting the available space.

When starting jobs, user can specify the resources needed, or else, if mins are defined, they are attached automatically

Free resources can be defined for a compute env, this is what anyone can use, for free, using startFreeCompute. They are part of compute env resources, and counted.

Docker with no free compute

export DOCKER_COMPUTE_ENVIRONMENTS="[{\"socketPath\":\"/var/run/docker.sock\",\"storageExpiry\":604800,\"resources\":[{\"id\":\"disk\",\"total\":1000000000}],\"maxJobDuration\":3600,\"fees\":{\"1\":[{\"feeToken\":\"0x123\",\"prices\":[{\"id\":\"cpu\",\"price\":1}]}]}}]"

All other settings are detected automatically, like noOfCpus and RAM:

2025-01-23T11:18:01.757Z debug: CORE:   CORE:   ComputeGetEnvironmentsCommand Response: 
[
    {
        "id": "0x6075f99f7962fbd0b122dd62405ee1d09ee6855eaf1095b60c9dbbdc5c76eb05-0xcd0ca6d958fc9f3dbb9a64da37ac638f0d8141319d8679eaf59d53f55a0e487f",
        "runningJobs": 0,
        "consumerAddress": "0xf9C5B7eE7708efAc6dC6Bc7d4b0455eBbf22b519",
        "platform": {
            "architecture": "x86_64",
            "os": "Ubuntu 22.04.3 LTS"
        },
        "fees": {
            "1": [
                [
                    {
                        "feeToken": "0x123",
                        "prices": [
                            {
                                "id": "cpu",
                                "price": 1
                            }
                        ]
                    }
                ]
            ]
        },
        "storageExpiry": 604800,
        "maxJobDuration": 3600,
        "resources": [
            {
                "id": "cpu",
                "total": 16,
                "max": 16,
                "min": 1,
                "inUse": 0
            },
            {
                "id": "ram",
                "total": 33617678336,
                "max": 33617678336,
                "min": 1000000000,
                "inUse": 0
            },
            {
                "id": "disk",
                "total": 1000000000,
                "max": 1000000000,
                "min": 0,
                "inUse": 0
            }
        ],
        "runningfreeJobs": 0
    }
]

Docker with free compute

export  DOCKER_COMPUTE_ENVIRONMENTS="[{\"socketPath\":\"/var/run/docker.sock\",\"resources\":[{\"id\":\"disk\",\"total\":1000000000}],\"storageExpiry\":604800,\"maxJobDuration\":3600,\"fees\":{\"1\":[{\"feeToken\":\"0x123\",\"prices\":[{\"id\":\"cpu\",\"price\":1}]}]},\"free\":{\"maxJobDuration\":60,\"maxJobs\":3,\"resources\":[{\"id\":\"cpu\",\"max\":1},{\"id\":\"ram\",\"max\":1000000000},{\"id\":\"disk\",\"max\":1000000000}]}}]"

All other settings are detected automatically, like noOfCpus and RAM:

2025-01-23T11:19:33.193Z debug: CORE:   CORE:   ComputeGetEnvironmentsCommand Response: 
[
    {
        "id": "0x7d187e4c751367be694497ead35e2937ece3c7f3b325dcb4f7571e5972d092bd-0x3a5647ecc1b6a3b4eaaeaf818037ca5a7aea23b2302f9c09c2ad716d3b0f94c7",
        "runningJobs": 0,
        "consumerAddress": "0xf9C5B7eE7708efAc6dC6Bc7d4b0455eBbf22b519",
        "platform": {
            "architecture": "x86_64",
            "os": "Ubuntu 22.04.3 LTS"
        },
        "fees": {
            "1": [
                [
                    {
                        "feeToken": "0x123",
                        "prices": [
                            {
                                "id": "cpu",
                                "price": 1
                            }
                        ]
                    }
                ]
            ]
        },
        "storageExpiry": 604800,
        "maxJobDuration": 3600,
        "resources": [
            {
                "id": "cpu",
                "total": 16,
                "max": 16,
                "min": 1,
                "inUse": 0
            },
            {
                "id": "ram",
                "total": 33617678336,
                "max": 33617678336,
                "min": 1000000000,
                "inUse": 0
            },
            {
                "id": "disk",
                "total": 1000000000,
                "max": 1000000000,
                "min": 0,
                "inUse": 0
            }
        ],
        "free": {
            "maxJobDuration": 60,
            "maxJobs": 3,
            "resources": [
                {
                    "id": "cpu",
                    "max": 1,
                    "inUse": 0
                },
                {
                    "id": "ram",
                    "max": 1000000000,
                    "inUse": 0
                },
                {
                    "id": "disk",
                    "max": 1000000000,
                    "inUse": 0
                }
            ]
        },
        "runningfreeJobs": 0
    }
]

ComputeEnv interface & getComputeEnvironments command

see new structures above

Commands

  • startFreeCompute requires environmentId now
  • for startFreeCompute and startCompute a new optional parameter resources is available. This will allow user to request what resources to use, instead of using all. Imagine the following scenario:
    - host has 16 cpus and 4 gpus exposed in a compute env
    - in the current setup, if you want to run a job, you will pay for all (16 x price_per_cpu_per_min + 4 x price_per_gpu)*minutes
    - by using something like resources: [{ "type":"cpu",amount: 4}, { "type":"gpu",amount: 1}] you will only pay for 4 cpus and gpus , because that is what you are going to get

@alexcos20 alexcos20 changed the title start refactor C2D refactor Jan 27, 2025
@alexcos20 alexcos20 marked this pull request as ready for review January 28, 2025 09:43
@jamiehewitt15
Copy link
Member

jamiehewitt15 commented Jan 29, 2025

I ran this branch with:

 export  DOCKER_COMPUTE_ENVIRONMENTS="[{\"socketPath\":\"/var/run/docker.sock\",\"resources\":[{\"id\":\"disk\",\"total\":1000000000}],\"storageExpiry\":604800,\"maxJobDuration\":3600,\"fees\":{\"1\":[{\"feeToken\":\"0x123\",\"prices\":[{\"id\":\"cpu\",\"price\":1}]}]},\"free\":{\"maxJobDuration\":60,\"maxJobs\":3,\"resources\":[{\"id\":\"cpu\",\"max\":1},{\"id\":\"ram\",\"max\":1000000000},{\"id\":\"disk\",\"max\":1000000000}]}}]"

I got the error below when calling GET /api/services/computeEnvironments. Either there is a bug or there is there something else that is needed in the setup? If something is missing in the setup we should send a more helpful response, currently it's just a 500 internal server error.

2025-01-29T10:21:15.395Z debug: HTTP:	HTTP:	GET computeEnvironments request received with query: {}
2025-01-29T10:21:15.396Z info: CORE:	Checking received command data for Command "getComputeEnvironments": {
    "command": "getComputeEnvironments",
    "chainId": null,
    "node": null
}
2025-01-29T10:21:15.467Z debug: CORE:	CORE:	ComputeGetEnvironmentsCommand Response: [
  {
    "id": "0x7d187e4c751367be694497ead35e2937ece3c7f3b325dcb4f7571e5972d092bd-0x071ead74e903edeb2ad40d196f03db09f70811ede01f3e111fd5106f52b388ee",
    "runningJobs": 0,
    "consumerAddress": "0xe2DD09d719Da89e5a3D0F2549c7E24566e947260",
    "platform": {
      "architecture": "x86_64",
      "os": "Ubuntu 22.04.4 LTS"
    },
    "fees": null,
    "storageExpiry": 604800,
    "maxJobDuration": 3600,
    "resources": [
      {
        "id": "cpu",
        "total": 2,
        "max": 2,
        "min": 1,
        "inUse": 0
      },
      {
        "id": "ram",
        "total": 4103393280,
        "max": 4103393280,
        "min": 1000000000,
        "inUse": 0
      },
      {
        "id": "disk",
        "total": 1000000000,
        "max": 1000000000,
        "min": 0,
        "inUse": 0
      }
    ],
    "free": {
      "maxJobDuration": 60,
      "maxJobs": 3,
      "resources": [
        {
          "id": "cpu",
          "max": 1,
          "inUse": 0
        },
        {
          "id": "ram",
          "max": 1000000000,
          "inUse": 0
        },
        {
          "id": "disk",
          "max": 1000000000,
          "inUse": 0
        }
      ]
    },
    "runningfreeJobs": 0
  }
]
2025-01-29T10:21:15.469Z error: Error: TypeError: Cannot read properties of undefined (reading 'length')

}
export interface FreeComputeStartCommand extends Command {
consumerAddress: string
signature: string
nonce: string
environment: string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this optional? In case there is only one free compute environment, then we don't need to send the environment ID. We just use the only one available. I'm not sure exactly why someone would have multiple free environments? Seems more likely that they would have multiple paid environments and then just a single free one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would agree, now its mandatory to know the id in advance, while before it was not... and usually there is only 1 free env

algorithm: ComputeAlgorithm
datasets?: ComputeAsset[]
output?: ComputeOutput
resources?: ComputeResourceRequest[]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why someone would use this on free compute? It's free, so everyone will just always want the maximum possible resources.

@jamiehewitt15
Copy link
Member

Despite the issue I'm getting when calling the GET /api/services/computeEnvironments it works fine for starting a compute job.

@paulo-ocean
Copy link
Contributor

paulo-ocean commented Jan 30, 2025

have the same issue
2025-01-29T10:21:15.469Z error: Error: TypeError: Cannot read properties of undefined (reading 'length')
, seems the problem is on areEmpty function, this line:
if (computeEnvs[supportedNetwork].length === 0) {
BUT
i don't see any job starting, are you sure?
Do we need changes on CLI + SDK or something? cause with the CLI we don't pass the last error and if i fix that i got others.. so not working for me at all

@jamiehewitt15
Copy link
Member

jamiehewitt15 commented Jan 30, 2025

When I increase the maxJobDuration to 600 on the free compute I get: 500 Invalid C2D Environment when trying to start a compute job via direct command. It works when I have it set at 60.

@paulo-ocean
Copy link
Contributor

ok, so for now, it only works with a directCommand... but we need to know the env id in advance..


export interface ComputeResource {
id: ComputeResourceType
type?: string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the type and kind? it can be a bit confusing with the ComputeResourceType type that is now the id ?

@jamiehewitt15
Copy link
Member

but we need to know the env id in advance

Yeah, it shows up in the logs on the node, so can be pasted from there. That's the way I got it to work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants