Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong replacements in targets file #4

Open
Mathebibel opened this issue Jun 12, 2021 · 3 comments
Open

Wrong replacements in targets file #4

Mathebibel opened this issue Jun 12, 2021 · 3 comments
Labels
bug Something isn't working

Comments

@Mathebibel
Copy link

Steps to reproduce the bug

Download one page of my website and all its assets with the command

wget -E -k -K -p https://www.mathebibel.de

Open the folder ẁww.mathebibel.de and run the command

cachekill -l 16 -r -p '{name}.{hash}{ext}' -s '**/*.{svg,jpg,jpeg,png,ico,ttf,woff,woff2}' -t '**/*.{html,css,js,webmanifest}'

Open the file index.html and search for da5a9c97d2054b2a.

You will get the following 6 hits:

  1. bilder/logo.da5a9c97d2054b2a.png
  2. awards/gpi-logo.da5a9c97d2054b2a.png
  3. awards/stern-logo.da5a9c97d2054b2a.png
  4. awards/ntv-logo.da5a9c97d2054b2a.png
  5. bilder/logo.da5a9c97d2054b2a.png
  6. bilder/easy-tutor-logo.da5a9c97d2054b2a.png

Number 1 and 5 are correct replacements. Number 2, 3, 4 and 6 are wrong replacements.

This was just an example. You can find even more wrong replacements in the index.html.

@Mathebibel
Copy link
Author

Mathebibel commented Jun 14, 2021

After some testing I managed to identify the bug.

In the following minimal working examples (MWE) I will use the command cachekill -s 'my-project/**/!(*.html)' -t 'my-project/**/*.html'

MWE 1

.
└── my-project/
    ├── a.jpg
    ├── b-a.jpg
    └── index.html

with index.html is

<img src="a.jpg">
<img src="b-a.jpg">

This works.

MWE 2

.
└── my-project/
    ├── assets/
    │   └── a.jpg
    ├── a.jpg
    └── index.html

with index.html is

<img src="a.jpg">
<img src="assets/a.jpg">

This doesn't work. Both relative paths get the same hash.

Solution for MWE 2

In line 126-133

async function replaceReferences(sourceBases: SourceBases[],
                                         targets: string[]): Promise<void> {
          await replace.replaceInFile({
            from: sourceBases.map(obj => new RegExp(obj.base, 'g')),
            to: sourceBases.map(obj => obj.newBase),
            files: targets
          });
}

we can see that you search for obj.base and replace it with obj.newBase.

  • You should search for obj.path and replace it with obj.newPath.
  • You should start your search with the lowest element in the file hierarchy.

To illustrate the last point:

.
└── my-project/
    ├── assets/
    │   ├── assets/
    │   │   └── a.jpg
    │   └── a.jpg
    ├── a.jpg
    └── index.html

The search/replace order should be:

  1. my-project/assets/assets/a.jpg (This is the lowest element in the file hierarchy!)
  2. my-project/assets/a.jpg
  3. my-project/a.jpg

Additional information

Of course the .jpg files in the examples above have only the same name but they look different - otherwise they would generate the same hash.

MWE 3

.
└── my-project/
    ├── a-b.jpg
    ├── b.jpg
    └── index.html

with index.html is

<img src="b.jpg">
<img src="a-b.jpg">

This doesn't work. Both relative paths get the same hash.

Solution for MWE 3

With line 73

sources.sort().reverse();

you solve the issue

  // Make sure that source files are sorted in a reverse alphabetical order,
  // so files with common filename endings don't get wrongly replaced. E.g.:
  // ['file.js', 'other-file.js'] --> ['other-file.js', 'file.js'].

but it doesn't solve the issue

// ['file-other.js', 'other.js'] --> ['other.js', 'file-other.js'].

Idea 1

I think you can delete this line and solve the issue if you add to your regex in line 129

from: sourcePaths.map(obj => new RegExp(obj.path, 'g')),

the delimiters of the path.

For example instead of searching for other.js we search for "other.js" (and 'other.js' and (other.js) and [other.js]).

I am not very happy with my solution because there are more delimiters and sometimes there are no delimiters (if the file name is mentioned in plain text for example).

Idea 2

Let's look at the following example:

.
└── my-project/
    ├── a.jpg
    ├── a-b.jpg
    ├── b.jpg
    ├── b-a.jpg
    └── index.html

with

sources.sort().reverse();

we get this order

1. b-a.jpg
2. b.jpg
3. a-b.jpg
4. a.jpg

The problem is if we search for b.jpg then a-b.jpg would be a match.

  • Group the file names by their character number
  • Start search/replace with the group with the most characters
  1. Group: a-b.jpg and b-a.jpg
  2. Group: a.jpg and b.jpg
  • Delete .sort().reverse() (Not necessary anymore as soon as we process the group with the most characters first)

Summary

Fix for MWE 2

  • You should search for obj.path and replace it with obj.newPath.
  • You should start your search with the lowest element in the file hierarchy.

On each hierarchy level we have to....

Fix for MWE 3

  • Group the file names by their character number
  • Start search/replace with the group with the most characters
  • Delete .sort().reverse() (Not necessary anymore as soon as we process the group with the most characters first)

Test case

my-project/
  assets/
    assets/
      a.jpg
      a-b.jpg
      b.jpg
      b-a.jpg
    a.jpg
    a-b.jpg
    b.jpg
    b-a.jpg
  a.jpg
  a-b.jpg
  b.jpg
  b-a.jpg
  index.html

with index.html is

<img src="a.jpg">
<img src="a-b.jpg">
<img src="b.jpg">
<img src="b-a.jpg">
<img src="assets/a.jpg">
<img src="assets/a-b.jpg">
<img src="assets/b.jpg">
<img src="assets/b-a.jpg">
<img src="assets/assets/a.jpg">
<img src="assets/assets/a-b.jpg">
<img src="assets/assets/b.jpg">
<img src="assets/assets/b-a.jpg">

(Of course the .jpg files in the example above have only the same name but they look different - otherwise they would generate the same hash.)

@Mathebibel
Copy link
Author

In my test case above I forgot that there are absolute and relative path on all hierarchy levels.

Test case (Updated)

my-project/
  assets/
    assets/
      a.jpg
      a-b.jpg
      b.jpg
      b-a.jpg
      index.html
    a.jpg
    a-b.jpg
    b.jpg
    b-a.jpg
    index.html
  a.jpg
  a-b.jpg
  b.jpg
  b-a.jpg
  index.html

with index.html is

<!-- Relative Path -->
<img src="a.jpg">
<img src="a-b.jpg">
<img src="b.jpg">
<img src="b-a.jpg">
<img src="assets/a.jpg">
<img src="assets/a-b.jpg">
<img src="assets/b.jpg">
<img src="assets/b-a.jpg">
<img src="assets/assets/a.jpg">
<img src="assets/assets/a-b.jpg">
<img src="assets/assets/b.jpg">
<img src="assets/assets/b-a.jpg">
<!-- Absolute Path -->
<img src="/a.jpg">
<img src="/a-b.jpg">
<img src="/b.jpg">
<img src="/b-a.jpg">
<img src="/assets/a.jpg">
<img src="/assets/a-b.jpg">
<img src="/assets/b.jpg">
<img src="/assets/b-a.jpg">
<img src="/assets/assets/a.jpg">
<img src="/assets/assets/a-b.jpg">
<img src="/assets/assets/b.jpg">
<img src="/assets/assets/b-a.jpg">

with assets/index.html is

<!-- Relative Path -->
<img src="../a.jpg">
<img src="../a-b.jpg">
<img src="../b.jpg">
<img src="../b-a.jpg">
<img src="a.jpg">
<img src="a-b.jpg">
<img src="b.jpg">
<img src="b-a.jpg">
<img src="assets/a.jpg">
<img src="assets/a-b.jpg">
<img src="assets/b.jpg">
<img src="assets/b-a.jpg">
<!-- Absolute Path -->
<img src="/a.jpg">
<img src="/a-b.jpg">
<img src="/b.jpg">
<img src="/b-a.jpg">
<img src="/assets/a.jpg">
<img src="/assets/a-b.jpg">
<img src="/assets/b.jpg">
<img src="/assets/b-a.jpg">
<img src="/assets/assets/a.jpg">
<img src="/assets/assets/a-b.jpg">
<img src="/assets/assets/b.jpg">
<img src="/assets/assets/b-a.jpg">

with assets/assets/index.html is

<!-- Relative Path -->
<img src="../../a.jpg">
<img src="../../a-b.jpg">
<img src="../../b.jpg">
<img src="../../b-a.jpg">
<img src="../a.jpg">
<img src="../a-b.jpg">
<img src="../b.jpg">
<img src="../b-a.jpg">
<img src="a.jpg">
<img src="a-b.jpg">
<img src="b.jpg">
<img src="b-a.jpg">
<!-- Absolute Path -->
<img src="/a.jpg">
<img src="/a-b.jpg">
<img src="/b.jpg">
<img src="/b-a.jpg">
<img src="/assets/a.jpg">
<img src="/assets/a-b.jpg">
<img src="/assets/b.jpg">
<img src="/assets/b-a.jpg">
<img src="/assets/assets/a.jpg">
<img src="/assets/assets/a-b.jpg">
<img src="/assets/assets/b.jpg">
<img src="/assets/assets/b-a.jpg">

@eneko89 eneko89 added the bug Something isn't working label Aug 5, 2021
@eneko89
Copy link
Owner

eneko89 commented Aug 5, 2021

Thanks a lot for the great explanation of the problem and solution proposals.

This will be prioritized in the following version I release. As you mention, sorting the array was not an integral solution to the problem. I already thought on the delimiters idea, but abandoned it for the same reason, it was more a dirty workaround than a solution.

Haven't had time to analyze your findings in depth, but grouping by character count and replacing starting with the longest to the shortest group looks like a good idea. Thanks again!

Of course, PRs are always welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants