Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: cache npm metadata #5491

Merged
merged 34 commits into from
Jun 24, 2023
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
d09c0d2
feat: cache npm metadata
paul-soporan Sep 23, 2022
9cd8119
Merge branch 'master' into paul/feat/cache-npm-metadata
paul-soporan Dec 2, 2022
f0aaee3
Merge branch 'master' into paul/feat/cache-npm-metadata
paul-soporan Jun 7, 2023
c577e8b
refactor: tweaks
paul-soporan Jun 7, 2023
175f1a1
refactor: more tweaks
paul-soporan Jun 8, 2023
dba9140
refactor: even more tweaks
paul-soporan Jun 8, 2023
7ec0583
chore: versions
paul-soporan Jun 8, 2023
2deff2a
chore: changelog
paul-soporan Jun 8, 2023
caa8e9c
test: update tests
paul-soporan Jun 8, 2023
a5ff389
chore: update benchmarks
paul-soporan Jun 8, 2023
a1bf470
chore: add todos
paul-soporan Jun 8, 2023
5df66c1
perf: trim stored metadata
paul-soporan Jun 8, 2023
f3fc013
refactor: don't export the cache version
paul-soporan Jun 8, 2023
9e42027
refactor: drop unnecessary `as Filename`s
paul-soporan Jun 10, 2023
f6fbaa5
fix: add name to stored metadata
paul-soporan Jun 10, 2023
5a0e7c5
feat: automatically derive cache key
paul-soporan Jun 10, 2023
7f7bde8
refactor: slice the cache key
paul-soporan Jun 10, 2023
65b4138
test: update tests
paul-soporan Jun 10, 2023
30a97dd
chore: fix benchmark script
paul-soporan Jun 10, 2023
7cc1f1b
Merge branch 'master' into paul/feat/cache-npm-metadata
paul-soporan Jun 16, 2023
a82c891
refactor: remove redundant check
paul-soporan Jun 16, 2023
47ab924
feat: disable metadata cache during hardened mode
paul-soporan Jun 16, 2023
3efc7a9
fix: operator precedence
paul-soporan Jun 17, 2023
ede7a2a
refactor: final tweaks
paul-soporan Jun 17, 2023
5268b7a
refactor: use return
paul-soporan Jun 17, 2023
c630d25
fix: ensure metadata cache is atomic
paul-soporan Jun 18, 2023
d838e3b
refactor: rename for consistency
paul-soporan Jun 18, 2023
dd07bf4
Apply suggestions from code review
paul-soporan Jun 21, 2023
e662c82
Revert "Apply suggestions from code review"
paul-soporan Jun 21, 2023
6778080
Merge branch 'master' into paul/feat/cache-npm-metadata
paul-soporan Jun 23, 2023
2b1e1ec
refactor: remove `deprecated` field
paul-soporan Jun 23, 2023
07b4ceb
refactor: move `npm-metadata` folder to `metadata/npm`
paul-soporan Jun 23, 2023
7e09021
refactor: use single argument
paul-soporan Jun 23, 2023
600c358
Merge branch 'master' into paul/feat/cache-npm-metadata
arcanis Jun 24, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions .pnp.cjs

Large diffs are not rendered by default.

39 changes: 39 additions & 0 deletions .yarn/versions/07d34d58.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
releases:
"@yarnpkg/cli": minor
"@yarnpkg/core": minor
"@yarnpkg/fslib": minor
"@yarnpkg/plugin-npm": minor

declined:
- "@yarnpkg/plugin-compat"
- "@yarnpkg/plugin-constraints"
- "@yarnpkg/plugin-dlx"
- "@yarnpkg/plugin-essentials"
- "@yarnpkg/plugin-exec"
- "@yarnpkg/plugin-file"
- "@yarnpkg/plugin-git"
- "@yarnpkg/plugin-github"
- "@yarnpkg/plugin-http"
- "@yarnpkg/plugin-init"
- "@yarnpkg/plugin-interactive-tools"
- "@yarnpkg/plugin-link"
- "@yarnpkg/plugin-nm"
- "@yarnpkg/plugin-npm-cli"
- "@yarnpkg/plugin-pack"
- "@yarnpkg/plugin-patch"
- "@yarnpkg/plugin-pnp"
- "@yarnpkg/plugin-pnpm"
- "@yarnpkg/plugin-stage"
- "@yarnpkg/plugin-typescript"
- "@yarnpkg/plugin-version"
- "@yarnpkg/plugin-workspace-tools"
- vscode-zipfs
- "@yarnpkg/builder"
- "@yarnpkg/doctor"
- "@yarnpkg/extensions"
- "@yarnpkg/libzip"
- "@yarnpkg/nm"
- "@yarnpkg/pnp"
- "@yarnpkg/pnpify"
- "@yarnpkg/sdks"
- "@yarnpkg/shell"
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ The following changes only affect people writing Yarn plugins:

### Installs

- Yarn now caches npm version metadata, leading to faster resolution steps and decreased network data usage.
- The `pnpm` linker avoids creating symlinks that lead to loops on the file system, by moving them higher up in the directory structure.
- The `pnpm` linker no longer reports duplicate "incompatible virtual" warnings.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ describe(`Commands`, () => {
await expect(run(`stage`, `-n`, {cwd: path})).resolves.toMatchObject({
stdout: [
`${npath.fromPortablePath(`${path}/.pnp.cjs`)}\n`,
`${npath.fromPortablePath(`${path}/.yarn/global/npm-metadata/a24526/localhost/no-deps.json`)}\n`,
`${npath.fromPortablePath(`${path}/.yarn/global/cache/no-deps-npm-1.0.0-cf533b267a-0.zip`)}\n`,
`${npath.fromPortablePath(`${path}/.yarn/cache/.gitignore`)}\n`,
`${npath.fromPortablePath(`${path}/.yarn/cache/no-deps-npm-1.0.0-cf533b267a-8bd88a447c.zip`)}\n`,
Expand Down
2 changes: 2 additions & 0 deletions packages/plugin-npm/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
"dependencies": {
"@yarnpkg/fslib": "workspace:^",
"enquirer": "^2.3.6",
"lodash": "^4.17.15",
"semver": "^7.1.2",
"ssri": "^6.0.1",
"tslib": "^2.4.0"
Expand All @@ -20,6 +21,7 @@
"@yarnpkg/plugin-pack": "workspace:^"
},
"devDependencies": {
"@types/lodash": "^4.14.136",
"@types/semver": "^7.1.0",
"@types/ssri": "^6.0.1",
"@yarnpkg/core": "workspace:^",
Expand Down
16 changes: 6 additions & 10 deletions packages/plugin-npm/sources/NpmSemverResolver.ts
Original file line number Diff line number Diff line change
Expand Up @@ -47,11 +47,9 @@ export class NpmSemverResolver implements Resolver {
if (range === null)
throw new Error(`Expected a valid range, got ${descriptor.range.slice(PROTOCOL.length)}`);

const registryData = await npmHttpUtils.get(npmHttpUtils.getIdentUrl(descriptor), {
customErrorMessage: npmHttpUtils.customPackageError,
configuration: opts.project.configuration,
ident: descriptor,
jsonResponse: true,
const registryData = await npmHttpUtils.getPackageMetadata(descriptor, {
project: opts.project,
version: semver.valid(range.raw) ? range.raw : undefined,
});

const candidates = miscUtils.mapAndFilter(Object.keys(registryData.versions), version => {
Expand Down Expand Up @@ -127,11 +125,9 @@ export class NpmSemverResolver implements Resolver {
if (version === null)
throw new ReportError(MessageName.RESOLVER_NOT_FOUND, `The npm semver resolver got selected, but the version isn't semver`);

const registryData = await npmHttpUtils.get(npmHttpUtils.getIdentUrl(locator), {
customErrorMessage: npmHttpUtils.customPackageError,
configuration: opts.project.configuration,
ident: locator,
jsonResponse: true,
const registryData = await npmHttpUtils.getPackageMetadata(locator, {
project: opts.project,
version,
});

if (!Object.prototype.hasOwnProperty.call(registryData, `versions`))
Expand Down
6 changes: 2 additions & 4 deletions packages/plugin-npm/sources/NpmTagResolver.ts
Original file line number Diff line number Diff line change
Expand Up @@ -39,10 +39,8 @@ export class NpmTagResolver implements Resolver {
async getCandidates(descriptor: Descriptor, dependencies: unknown, opts: ResolveOptions) {
const tag = descriptor.range.slice(PROTOCOL.length);

const registryData = await npmHttpUtils.get(npmHttpUtils.getIdentUrl(descriptor), {
configuration: opts.project.configuration,
ident: descriptor,
jsonResponse: true,
const registryData = await npmHttpUtils.getPackageMetadata(descriptor, {
project: opts.project,
});

if (!Object.prototype.hasOwnProperty.call(registryData, `dist-tags`))
Expand Down
209 changes: 180 additions & 29 deletions packages/plugin-npm/sources/npmHttpUtils.ts
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
import {Configuration, Ident, formatUtils, httpUtils, nodeUtils, StreamReport} from '@yarnpkg/core';
import {MessageName, ReportError} from '@yarnpkg/core';
import {prompt} from 'enquirer';
import {URL} from 'url';
import {Configuration, Ident, formatUtils, httpUtils, nodeUtils, StreamReport, structUtils, IdentHash, hashUtils, Project} from '@yarnpkg/core';
import {MessageName, ReportError} from '@yarnpkg/core';
import {Filename, ppath, toFilename, xfs} from '@yarnpkg/fslib';
import {prompt} from 'enquirer';
import pick from 'lodash/pick';
import {URL} from 'url';

import {Hooks} from './index';
import * as npmConfigUtils from './npmConfigUtils';
import {MapLike} from './npmConfigUtils';
import {Hooks} from './index';
import * as npmConfigUtils from './npmConfigUtils';
import {MapLike} from './npmConfigUtils';

export enum AuthType {
NO_AUTH,
Expand Down Expand Up @@ -33,7 +35,7 @@ export type Options = httpUtils.Options & RegistryOptions & {
* It doesn't handle 403 Forbidden, as the npm registry uses it when the user attempts
* a prohibited action, such as publishing a package with a similar name to an existing package.
*/
export async function handleInvalidAuthenticationError(error: any, {attemptedAs, registry, headers, configuration}: {attemptedAs?: string, registry: string, headers: {[key: string]: string} | undefined, configuration: Configuration}) {
export async function handleInvalidAuthenticationError(error: any, {attemptedAs, registry, headers, configuration}: {attemptedAs?: string, registry: string, headers: {[key: string]: string | undefined} | undefined, configuration: Configuration}) {
if (isOtpError(error))
throw new ReportError(MessageName.AUTHENTICATION_INVALID, `Invalid OTP token`);

Expand Down Expand Up @@ -64,15 +66,166 @@ export function getIdentUrl(ident: Ident) {
}
}

export type GetPackageMetadataOptions = Omit<Options, 'ident' | 'configuration'> & {
project: Project;

/**
* Warning: This option will return all cached metadata if the version is found, but the rest of the metadata can be stale.
*/
version?: string;
};

// We use 2 different caches:
// - an in-memory cache, to avoid hitting the disk and the network more than once per process for each package
// - an on-disk cache, for exact version matches and to avoid refetching the metadata if the resource hasn't changed on the server

const PACKAGE_METADATA_CACHE = new Map<IdentHash, PackageMetadata>();
paul-soporan marked this conversation as resolved.
Show resolved Hide resolved

/**
* Caches and returns the package metadata for the given ident.
*
* Note: This function only caches and returns specific fields from the metadata.
* If you need other fields, use the uncached {@link get} or consider whether it would make more sense to extract
* the fields from the on-disk packages using the linkers or from the fetch results using the fetchers.
*/
export async function getPackageMetadata(ident: Ident, {project, registry, headers, version, ...rest}: GetPackageMetadataOptions): Promise<PackageMetadata> {
const {configuration} = project;

const cachedInMemory = PACKAGE_METADATA_CACHE.get(ident.identHash);
if (cachedInMemory)
return cachedInMemory;

registry = normalizeRegistry(configuration, {ident, registry});

const registryFolder = getRegistryFolder(configuration, registry);
const identPath = ppath.join(registryFolder, `${structUtils.slugifyIdent(ident)}.json`);

let cachedOnDisk: CachedMetadata | null = null;

// We bypass the on-disk cache for security reasons if the lockfile needs to be refreshed,
// since most likely the user is trying to validate the metadata using hardened mode.
if (!project.lockfileNeedsRefresh) {
try {
cachedOnDisk = await xfs.readJsonPromise(identPath) as CachedMetadata;

if (typeof version !== `undefined` && typeof cachedOnDisk.metadata.versions[version] !== `undefined`) {
return cachedOnDisk.metadata;
}
} catch {}
}

return await get(getIdentUrl(ident), {
...rest,
customErrorMessage: customPackageError,
configuration,
registry,
ident,
headers: {
...headers,
// We set both headers in case a registry doesn't support ETags
[`If-None-Match`]: cachedOnDisk?.etag,
[`If-Modified-Since`]: cachedOnDisk?.lastModified,
},
wrapNetworkRequest: async executor => async () => {
const response = await executor();

if (response.statusCode === 304) {
if (cachedOnDisk === null)
throw new Error(`Assertion failed: cachedMetadata should not be null`);

return {
...response,
body: cachedOnDisk.metadata,
};
}

const packageMetadata = pickPackageMetadata(JSON.parse(response.body.toString()));

PACKAGE_METADATA_CACHE.set(ident.identHash, packageMetadata);

const metadata: CachedMetadata = {
metadata: packageMetadata,
etag: response.headers.etag,
lastModified: response.headers[`last-modified`],
};

await xfs.mkdirPromise(registryFolder, {recursive: true});
await xfs.writeJsonPromise(identPath, metadata, {compact: true});
paul-soporan marked this conversation as resolved.
Show resolved Hide resolved

return {
...response,
body: packageMetadata,
};
},
});
}

type CachedMetadata = {
metadata: PackageMetadata;
etag?: string;
lastModified?: string;
};

export type PackageMetadata = {
'dist-tags': Record<string, string>;
versions: Record<string, any>;
};

const CACHED_FIELDS = [
`name`,

`deprecated`,
`dist.tarball`,

`bin`,
`scripts`,

`os`,
`cpu`,
`libc`,

`dependencies`,
`dependenciesMeta`,
`optionalDependencies`,

`peerDependencies`,
`peerDependenciesMeta`,
];

function pickPackageMetadata(metadata: PackageMetadata): PackageMetadata {
return {
'dist-tags': metadata[`dist-tags`],
versions: Object.fromEntries(Object.entries(metadata.versions).map(([key, value]) => [
key,
pick(value, CACHED_FIELDS),
])),
};
}

/**
* Used to invalidate the on-disk cache when the format changes.
*/
const CACHE_KEY = hashUtils.makeHash(...CACHED_FIELDS).slice(0, 6);

function getRegistryFolder(configuration: Configuration, registry: string) {
const metadataFolder = getMetadataFolder(configuration);

const parsed = new URL(registry);
const registryFilename = toFilename(parsed.hostname);

return ppath.join(metadataFolder, CACHE_KEY as Filename, registryFilename);
}

function getMetadataFolder(configuration: Configuration) {
return ppath.join(configuration.get(`globalFolder`), `npm-metadata`);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking perhaps <global>/cache/metadata might be a better location, what do you think? Actions that already cache the <global>/cache folder would benefit from that out of the box, rather than having to add a new path (or cache the whole <global>, which may not be as common).

I'd also suggest moving the npm- prefix from the folder to each individual file name, by consistency with the cache itself (which contains files from the npm, git, http fetchers, etc).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping on this comment?

Copy link
Member Author

@paul-soporan paul-soporan Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry for the delay.)

I'm thinking perhaps <global>/cache/metadata might be a better location, what do you think?

I don't really like the idea, the mirror and the npm metadata cache are 2 separate things, and merging them in a single folder would lead to confusion. We've also got index which would have to be moved to cache too with this line of thinking.

This would just complicate things like yarn cache clean, where --mirror would have to stop removing the entire folder and just remove the archives.

I'd prefer all of them to be in separate folders. One thing we could do would be to move the mirror to <global>/cache/mirror in addition to what you proposed, then I'd be fine with it and it would still benefit from existing caching (but it would look a bit weird to have the global cache in <global>/cache/mirror, but I guess that's just the consequence of the mirror and the global cache being the same thing).

I'd also suggest moving the npm- prefix from the folder to each individual file name, by consistency with the cache itself (which contains files from the npm, git, http fetchers, etc).

I'd rather not. It would just give the illusion of consistency.

The cache can do it because it's the sole source that controls that folder.

The npm metadata cache is supposed to be specific to the plugin-npm resolvers - they are the only ones that populate and control it.

Moving the prefix to the filenames and having the folder called just metadata gives the illusion that the folder is a generalized metadata cache. And it could be indeed, if other resolvers ever need to cache metadata in it, but it would have different shapes (and possibly different filename formats) based on the resolvers that cache the data and that might lead to confusion.

Edit: The files also wouldn't be in a single folder like the cache, since e.g. for npm we have npm-metadata/<hash>/<registry>/lodash.json.

It would also have to be controlled by the core, which would be tasked with automatically generating the paths to ensure that no 2 resolvers accidentally use the same path (and also to make it possible for us to change the metadata cache path in the future).

In addition, we're not even certain that the current kind of cache is better than a monolithic one, that's something I'm open to experimenting with in the future.

That's why I think that opening the folder to anything but the npm resolvers is not something I want to do yet, if ever.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Thought more about it and I'd be open to making it <global>/metadata/npm/<hash>/<registry>/lodash.json.

This way, we still have a common metadata folder but we make it clear that each resolver has to manage it manually.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable. I like the idea of moving the cache into cache/mirror (that said, I think we could also just rename --mirror into -g for the same effect; I think I'd like this even better 🤔).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I still think I'd prefer to keep the mirror, metadata, and index separate.

I think we could also just rename --mirror into -g for the same effect

We could have a -g too but I prefer to retain the granularity. yarn cache clean will become interactive in a future PR anyways (just like we discussed some time ago), and it will clean everything by default.

}

export async function get(path: string, {configuration, headers, ident, authType, registry, ...rest}: Options) {
if (ident && typeof registry === `undefined`)
registry = npmConfigUtils.getScopeRegistry(ident.scope, {configuration});
registry = normalizeRegistry(configuration, {ident, registry});
paul-soporan marked this conversation as resolved.
Show resolved Hide resolved

if (ident && ident.scope && typeof authType === `undefined`)
authType = AuthType.BEST_EFFORT;

if (typeof registry !== `string`)
throw new Error(`Assertion failed: The registry should be a string`);

const auth = await getAuthenticationHeader(registry, {authType, configuration, ident});
if (auth)
headers = {...headers, authorization: auth};
Expand All @@ -87,11 +240,7 @@ export async function get(path: string, {configuration, headers, ident, authType
}

export async function post(path: string, body: httpUtils.Body, {attemptedAs, configuration, headers, ident, authType = AuthType.ALWAYS_AUTH, registry, otp, ...rest}: Options & {attemptedAs?: string}) {
if (ident && typeof registry === `undefined`)
registry = npmConfigUtils.getScopeRegistry(ident.scope, {configuration});

if (typeof registry !== `string`)
throw new Error(`Assertion failed: The registry should be a string`);
registry = normalizeRegistry(configuration, {ident, registry});

const auth = await getAuthenticationHeader(registry, {authType, configuration, ident});
if (auth)
Expand Down Expand Up @@ -123,11 +272,7 @@ export async function post(path: string, body: httpUtils.Body, {attemptedAs, con
}

export async function put(path: string, body: httpUtils.Body, {attemptedAs, configuration, headers, ident, authType = AuthType.ALWAYS_AUTH, registry, otp, ...rest}: Options & {attemptedAs?: string}) {
if (ident && typeof registry === `undefined`)
registry = npmConfigUtils.getScopeRegistry(ident.scope, {configuration});

if (typeof registry !== `string`)
throw new Error(`Assertion failed: The registry should be a string`);
registry = normalizeRegistry(configuration, {ident, registry});

const auth = await getAuthenticationHeader(registry, {authType, configuration, ident});
if (auth)
Expand Down Expand Up @@ -159,11 +304,7 @@ export async function put(path: string, body: httpUtils.Body, {attemptedAs, conf
}

export async function del(path: string, {attemptedAs, configuration, headers, ident, authType = AuthType.ALWAYS_AUTH, registry, otp, ...rest}: Options & {attemptedAs?: string}) {
if (ident && typeof registry === `undefined`)
registry = npmConfigUtils.getScopeRegistry(ident.scope, {configuration});

if (typeof registry !== `string`)
throw new Error(`Assertion failed: The registry should be a string`);
registry = normalizeRegistry(configuration, {ident, registry});

const auth = await getAuthenticationHeader(registry, {authType, configuration, ident});
if (auth)
Expand Down Expand Up @@ -194,6 +335,16 @@ export async function del(path: string, {attemptedAs, configuration, headers, id
}
}

function normalizeRegistry(configuration: Configuration, {ident, registry}: Partial<RegistryOptions>): string {
if (typeof registry === `undefined` && ident)
return npmConfigUtils.getScopeRegistry(ident.scope, {configuration});

if (typeof registry !== `string`)
throw new Error(`Assertion failed: The registry should be a string`);

return registry;
}

async function getAuthenticationHeader(registry: string, {authType = AuthType.CONFIGURATION, configuration, ident}: {authType?: AuthType, configuration: Configuration, ident: RegistryOptions['ident']}) {
const effectiveConfiguration = npmConfigUtils.getAuthConfiguration(registry, {configuration, ident});
const mustAuthenticate = shouldAuthenticate(effectiveConfiguration, authType);
Expand Down Expand Up @@ -242,7 +393,7 @@ function shouldAuthenticate(authConfiguration: MapLike, authType: AuthType) {
}
}

async function whoami(registry: string, headers: {[key: string]: string} | undefined, {configuration}: {configuration: Configuration}) {
async function whoami(registry: string, headers: {[key: string]: string | undefined} | undefined, {configuration}: {configuration: Configuration}) {
paul-soporan marked this conversation as resolved.
Show resolved Hide resolved
if (typeof headers === `undefined` || typeof headers.authorization === `undefined`)
return `an anonymous user`;

Expand Down
4 changes: 2 additions & 2 deletions packages/yarnpkg-core/sources/Plugin.ts
Original file line number Diff line number Diff line change
Expand Up @@ -91,9 +91,9 @@ export interface Hooks {
* add some logging.
*/
wrapNetworkRequest?: (
executor: () => Promise<any>,
executor: () => Promise<httpUtils.Response>,
extra: WrapNetworkRequestInfo
) => Promise<() => Promise<any>>;
) => Promise<() => Promise<httpUtils.Response>>;

/**
* Called before the build, to compute a global hash key that we will use
Expand Down
Loading