diff --git a/packages/metascraper-readability/README.md b/packages/metascraper-readability/README.md index f9f6af968..c2d4ccfc8 100644 --- a/packages/metascraper-readability/README.md +++ b/packages/metascraper-readability/README.md @@ -14,6 +14,19 @@ $ npm install metascraper-readability --save ``` +## API + +### metascraper-readability([options]) + +#### options + +##### getDocument + +Type: `function`
+Default: [source code](https://github.com/microlinkhq/metascraper/blob/master/packages/metascraper-readability/src/index.js#L14-L20) + +The function to be called to serialized html into a DOM document. + ## License **metascraper-readability** © [Microlink](https://microlink.io), released under the [MIT](https://github.com/microlinkhq/metascraper/blob/master/LICENSE.md) License.
diff --git a/packages/metascraper-readability/benchmark/fixture.html b/packages/metascraper-readability/benchmark/fixture.html new file mode 100644 index 000000000..abf16abe3 --- /dev/null +++ b/packages/metascraper-readability/benchmark/fixture.html @@ -0,0 +1,1345 @@ + + +2412.06592 + +
+

PrEditor3D: Fast and Precise 3D Shape Editing

+

Ziya Erkoc¸1

+

Can G¨umeli1

+

Chaoyang Wang2

+

Matthias Nießner1

+

Angela Dai1

+

Peter Wonka2,3

+

Hsin-Ying Lee2

+

Peiye Zhuang2

+

1Technical University of Munich

+

2Snap Inc

+

3King Abdullah University of Science and Technology

+

https://ziyaerkoc.com/preditor3d

+ +

1

+

3

+

5

+

7

+ +

“chicken in a racing car”

+

“viking with a

+

mustache, helmet”

+

“... helmet party hat

+

“… red pepper

+

mustache …”

+

“… racing car jeep

+

chicken cat …”

+ + +

“cannon

+

in a wooden cart”

+

cannon

+

flamethrower …”

+ +

cannon

+

trebuchet …”

+ + +

“truck carrying clay”

+

“… clay pizza

+

Time

+

CLIPdir-cos

+

1 min

+

4 min

+

1 hr

+

Vox-E

+

Ours

+

MV-Edit

+

Tailor3D

+

+ + + + + +

“chicken in a

+

racing car”

+

chicken

+

cat with a tail …”

+ + + + + + + + +

chicken dog

+ + + + +

“android”

+

android chicken

+

wearing a tie

+ + +

“house in a forest”

+

house treasure

+

cave …”

+

Figure 1. PrEditor3D is a (top) fast and high-quality editing method that can perform precise and consistent editing only in the intended

+

regions, keeping the rest identical. (mid) It can handle diverse editing prompts with any given 3D object. (bottom) Furthermore, it can

+

support iterative editing, facilitating artistic workflow, and can also support editing multiple regions in a single run.

+

Abstract

+

We propose a training-free approach to 3D editing that

+

enables the editing of a single shape within a few minutes.

+

The edited 3D mesh aligns well with the prompts, and re-

+

mains identical for regions that are not intended to be al-

+

tered. To this end, we first project the 3D object onto 4-view

+

images and perform synchronized multi-view image edit-

+

ing along with user-guided text prompts and user-provided

+

rough masks. However, the targeted regions to be edited

+

are ambiguous due to projection from 3D to 2D. To en-

+

sure precise editing only in intended regions, we develop

+

a 3D segmentation pipeline that detects edited areas in 3D

+

space, followed by a merging algorithm to seamlessly in-

+

tegrate edited 3D regions with the original input. Exten-

+

sive experiments demonstrate the superiority of our method

+

over previous approaches, enabling fast, high-quality edit-

+

ing while preserving unintended regions.

+

arXiv:2412.06592v1 [cs.CV] 9 Dec 2024

+
+
+

1. Introduction

+

Recent 3D diffusion models can generate high-quality as-

+

sets that closely align with the text prompts in the form of

+

neural fields [22, 37], meshes [2, 38], or Gaussian point

+

clouds [47]. Although these methods generate impressive

+

results, they lack the essential capability for precise and

+

controllable editing of the generated outputs, a critical re-

+

quirement for iterative artistic workflows. Effective 3D edit-

+

ing demands: (1) it should be fast enough to provide quick

+

feedback, ideally comparable to fast 3D generation algo-

+

rithms, and (2) it must allow for precise local control, en-

+

abling users to keep specific parts of the model unchanged.

+

Enabling precise and controllable editing is still an open

+

challenge. Several initial approaches have been proposed to

+

tackle the challenge of 3D editing [8, 9, 13, 16, 31, 36], pro-

+

viding promising results but suffering from slow runtime,

+

lack of precise control, and/or lack of 3D consistency and

+

quality. Optimization-based techniques like SDS [9, 13, 36]

+

or multiview training dataset updates [16] are computation-

+

ally expensive, making interactive editing out of reach.

+

Additionally, they offer limited control over specific

+

parts of the shape, as text prompts alone cannot precisely

+

localize regions to be edited [8, 9, 16, 31, 36]. While Vox-

+

E [36] and Shap-Editor [9] propose a mechanism to prevent

+

original parts of the shape from being altered during editing,

+

they do not enable precise editing due to having only text as

+

input. Finally, one can observe various visual quality prob-

+

lems, such as the Janus problem, blurring, over-saturation,

+

and overemphasizing texture changes while leaving the ge-

+

ometry intact or degrading.

+

To address these challenges, we propose a novel editing

+

pipeline for 3D assets that is faster, more precise, and de-

+

livers high-quality results (See Fig. 1). As our primary goal

+

is faster editing, we propose an editing framework lever-

+

aging a pipeline that consists of two components: a multi-

+

view diffusion algorithm and a feed-forward mesh recon-

+

struction. Multi-view diffusion models can leverage supe-

+

rior 2D editing techniques, and the feed-forward mesh re-

+

construction bridges the gap between 2D and 3D. For bet-

+

ter controllability, we extend multi-view image generation

+

to multi-view image editing using 2D masks to constrain

+

the edits to user-specified regions. The 2D masks can take

+

various forms, including manually selected regions, hand-

+

brushed areas, or automatically generated segmentations.

+

We adopt DDPM inversion [20] to extract initial noise vec-

+

tors from input multi-view images and execute Prompt-to-

+

Prompt [17] on a multi-view diffusion model [37]. We use

+

2D user-provided masks to blend edited and original views

+

during the denoising. However, due to the inherent ambigu-

+

ity caused by projection from 3D to 2D, we cannot obtain

+

ideal intended regions in 2D regardless of the granularity of

+

the masks, as shown in Fig. 2. However, the masks are often

+

too rough to precisely capture the intended semantic editing

+ + + + + + +

Coarse

+

Fine

+

Limited editing regions

+

3D à 2D

+

User-Provided mask granularity

+

Alter unintended regions

+

Figure 2. Ambiguous intended regions. The intended region to

+

be edited is clear in 3D (e.g. the cat tail). However, after projecting

+

to 2D, regardless of the granularity of the user-provided masks, the

+

editing will either alter some unintended regions (e.g. the robot

+

cat) or be too limited for reasonable editing.

+

regions. The masks are either too coarse so the unintended

+

regions will be changed, or too fine-grained to allow rea-

+

sonable editing. Without additional spatial information in

+

3D, multi-view editing approaches cannot fully address this

+

challenge. Therefore, simply adopting a feed-forward re-

+

construction method [50] to convert edited multi-views into

+

a 3D mesh often leads to undesirable results.

+

To tackle this issue, we propose using the original 3D in-

+

put and 3D segmentation. We first detect the intended edit-

+

ing region using Grounding DINO [23] and SAM 2 [34]

+

with the user mask and prompts. This gives an initial 2D

+

segmentation that we subsequently lift to 3D. For this pur-

+

pose, we use color coding in a multi-view to 3D reconstruc-

+

tion pipeline, named GTR [50], to end up with a 3D seg-

+

mentation that we can use during merging. Specifically, we

+

paint 2D segmentations in a specific color, (e.g., green) and

+

after reconstruction, we can detect which regions are edited

+

by querying the color in the 3D field.

+

Then, we perform merging to maintain the original parts

+

of the shape. We use 3D masks to detect edited/replaced

+

parts in GTR’s [50] voxel-feature space. We extract the

+

edited part from the new shape and replace it with the old

+

part from the original shape. That way, we guarantee that

+

the remaining shape will remain identical. We apply a final

+

average blending operation so that new parts and the origi-

+

nal shape blend smoothly.

+

In summary, we make the following contributions:

+

• We propose a novel method for diffusion-based 3D ob-

+

ject editing that is faster than previous work and enables

+

precise, interactive editing.

+

• The proposed method consists of user-guided and multi-

+

view-synchronized editing and a feed-forward 3D recon-

+

struction, enabling fast editing in a feed-forward manner.

+

• To enable precise editing only to intended regions, we

+

propose a voxel-based 3D segmentation method that uti-

+

lizes multi-view segmentation information and propa-

+

gates it to 3D, followed by an average blending operation

+

to merge the edited and original objects.

+

• Our editing method has superior quality to previous work.

+

We show significant improvements in GPTEval3D [44],

+

directional CLIP metrics, and extensive user studies.

+
+
+ + + + + + + + + + + + + + +

Orgnal Voxel Feature

+

Edted Voxel Feature

+

Edted Mult-Vew

+

“chcken

+

happy gnger cat ...”

+

“chckenn

+

a racng car”

+ + + + + + + + + + + + + + + + + + + + + + +

Rendered Mult-Vew

+

User-Provded Mask

+ + +

Groundng + SAM 2

+ +

3D Reconstructon

+ +

Mergng & Renderng

+

Edted 3D

+

Input 3D

+ +

Mult

+ +

-

+ +

V

+ +

ew

+ +

Edtng

+

Figure 3. Overview of PrEditor3D. Given an input 3D object, we first render its multi-view images from 4 orthogonal views. We then

+

obtain editing input from the user, describing in text as well as rough 2D masks the desired edits. We perform synchronized multi-view

+

editing based on the text prompts as well as the user-provided masks (Sec.3.1. Due to the rough masks and the unclear intended regions

+

caused by ambiguous 3D-2D projection, we detect the intended regions with Grounding Dino and SAM 2 (Sec.3.2, where the segmentation

+

results are lifted to 3D for the final merging operation (Sec.3.3).

+

2. Related Works

+

2D Editing applies global or local modifications to an im-

+

age based on user instructions.

+

It has gained significant

+

attention for enabling a more interactive user experience

+

in content creation. To achieve this, existing methods ei-

+

ther fine-tune text-to-image models with specialized in-

+

structional editing datasets [45, 49], or use training-free ap-

+

proaches with inpainting [3, 29] or cross-attention mech-

+

anisms [6, 7, 17]. To edit user-provided images, various

+

diffusion inversion techniques have been developed. For

+

example, SDEdit [25] introduces noise to images to capture

+

intermediate steps in the diffusion process, null-text inver-

+

sion [27] inverts the deterministic DDIM inversion [39], and

+

recent works [4, 20] use DDPM inversion [18] to enhance

+

editing capabilities.

+

In this work, one step to achieve 3D editing is to per-

+

form 2D multi-view image editing. We apply the DDPM

+

inversion [18] and Prompt2Prompt [17] to operate within a

+

diffusion-based sparse multi-view generation model.

+

3D Reconstruction from Sparse Multi-Views refers to the

+

task of reconstructing a 3D instance from a limited number

+

of multi-view images. To achieve this, Score Distillation

+

Sampling (SDS) [30] and its variants [10, 22, 32, 41, 42]

+

optimize a 3D scene representation by reconstructing the

+

given sparse views and generating novel views through gra-

+

dients from large-scale pre-trained text-to-image diffusion

+

models [35]. However, these approaches are often com-

+

putationally demanding. Beyond per-scene optimization,

+

PixelNeRF [48] generates a NeRF representation [26] of

+

a scene from sparse-view images by training the model

+

across multiple scenes. With recent advancements in large-

+

scale 3D datasets such as Objaverse [12], follow-up works

+

[19, 21, 24, 28, 40, 43, 46, 50] have improved this feed-

+

forward 3D reconstruction approach for object-centric 3D

+

assets. In this work, we utilize one recent method, GTR

+

[50], to quickly generate 3D meshes from 4 views.

+

3D Editing presents additional challenges compared to 2D

+

editing due to the need to maintain spatial consistency in

+

3D. To address this, some methods train 3D editing models

+

using paired 3D datasets [1, 45], however, these approaches

+

are limited by a lack of diverse and complex datasets. Other

+

methods adapt 2D editing models, such as InstructPix2Pix

+

[5], for the 3D domain [16, 31], by iteratively updating

+

multi-view images of a scene. However, without synchro-

+

nized multi-view updates, this approach often results in

+

flickering or inconsistent views. Alternatively, some meth-

+

ods propose a generation-reconstruction loop to modify 3D

+

representations using intermediate denoised images [8] or

+

SDS gradients [9, 11, 13, 36] in the diffusion process. While

+

these methods can achieve 3D consistency, they often strug-

+

gle with quality or suffer from high computational costs. In

+

our work, we perform multi-view editing and reconstruct

+

the edited object in 3D. Beyond that, to preserve unchanged

+

regions, we carefully design an approach by detecting the

+

edited 3D regions and integrating the intended 3D edited

+

regions into the original shapes.

+

3. Method

+

We aim to achieve fast 3D asset editing in a training-free

+

manner, allowing for precise and user-guided edits. Our

+

approach achieves this through multi-view image editing

+

in 2D, followed by lifting the 2D edits into 3D. This pro-

+

cess can be summarized into 3 main steps: (1) synchronized

+

sparse multi-view editing in 2D (Sec.3.1), (2) detecting in-

+

tended editing regions across 2D views through the Ground-

+
+
+

ing Dino [23] and SAM 2 [34] approach (Sec.3.2), and (3)

+

lifting the intended editing regions to 3D and merging the

+

edited shape into the original (Sec.3.3). Our approach is

+

illustrated in Fig. 3.

+

3.1. Synchronized Sparse Multi-View Editing

+

To edit a given 3D object O, we leverage the power of

+

2D editing through multi-view diffusion models. We first

+

perform synchronized sparse multi-view edits using a pre-

+

trained multi-view diffusion model. In practice, we use MV-

+

Dream [37], which generates 4 orthogonal views based on a

+

text prompt. Note that our approach remains agnostic to the

+

specific diffusion-based multi-view generation model used.

+

We first render multi-view images from O, then apply

+

the DDPM diffusion inversion mechanism [20] to revert the

+

multi-view images to their initial noise vectors, denoted as

+

xT , where T is the number of diffusion timesteps during

+

the denoising process. We will use these vectors xT as the

+

initial latent vectors for the editing process in diffusion. We

+

denote the text prompt for the input shape and the edited

+

shape as yi and ye, respectively. Also, to enable better edit-

+

ing control and interaction, we further take user-provided

+

masks in 4 views as input. These masks indicate target re-

+

gions for the edit, denoted as Muser R4×H×W where H

+

and W are the height and weight of the images. Note that

+

our method does not require precise and accurate masks.

+

To edit multi-view images, we apply the Prompt-to-

+

Prompt [17] approach on the multi-view diffusion model.

+

For simplicity, we present the basic operation at each diffu-

+

sion timestep and within each attention block. To be spe-

+

cific, Prompt-to-Prompt [17] generates an edited latent vec-

+

tor x

+

e by replacing the self- and cross-attention weights of

+

the original latent vector, xi, with the edited prompt ye. To

+

confine modifications to the desired regions, we blend the

+

latent vectors x

+

e and xi using user-provided masks Muser.

+

We denote the final edited latent vector at each step as xe:

+

xe Muser · x

+

e + (1 Muser) · xi.

+

(1)

+

In practice, we downsample the user-provided masks Muser

+

to match the feature resolution at each model layer, and the

+

edited latent vector, xe, serves as the input to the next model

+

layer.

+

Using the inverted noise vector xT ensures that the

+

edited results align with the original texture style. How-

+

ever, the masks Muser are often imprecise or misaligned

+

with the target regions, which can still affect regions that are

+

not intended to be altered. Furthermore, occlusions along

+

the depth dimension introduce additional challenges in ac-

+

curately and semantically localizing the intended editing re-

+

gions in 3D. We illustrate the issue in Fig. 2. To address this,

+

we apply an automated grounding approach that detects the

+

intended editing region in both 2D and 3D.

+

Algorithm 1: Merging Voxel Features

+

Input: Vi and Ve RA×A×A×F , Mi and Me

+

RA×A×A, d N, and θ [0, 1]

+

Output: Vblend RA×A×A×F

+

1 Vi[Mi] ← ∅

+

2 Vi[Me] Ve[Me]

+

3 N Dilation(Me, d)

+

4 K Me N // means XOR

+

5 Vblend Vi

+

6 Vblend[K] θ Vi[K] + (1 θ) Ve[K]

+

7 return Vblend

+

3.2. Detection of Intended Editing Regions in 2D

+

The intended editing regions refer to the specific semantic

+

areas that correspond to the editing prompt. For example, in

+

Fig. 1, changing “chicken” to “cat” implies that the desired

+

editing region pertains only to the areas representing the

+

“chicken” and “cat” concepts. Regions (both 2D and 3D)

+

that are not semantically related to these concepts should

+

remain unchanged.

+

To ensure that only intended editing regions are edited

+

while allowing rough user-provided masks, we propose to

+

detect the intended editing regions by applying Grounding

+

Dino [23] and SAM 2 [34] to both the original and the

+

edited multi-view images. To begin with, we identify the

+

changing concept by comparing the original prompt yi and

+

the editing prompt ye. We then localize the changing con-

+

cept within the user-provided mask regions Muser and ob-

+

tain corresponding bounding boxes in multiple views. For-

+

mally, we write the procedure as

+

bboxx Grounding(x, Muser, yi, ye),

+

(2)

+

where bboxx are the bounding boxes for the changing con-

+

cept in the multi-view images x ∈ {xi, xe}.1

+

Afterward, we segment and track the changing ele-

+

ments across views using the grounding bounding boxes

+

bboxx.

+

Formally, this procedure can be described as

+

SAM(x, bboxx). This process yields the intended editing

+

regions in segmentation format for both the original and

+

edited multi-views.

+

3.3. Lifting and Merging Edits in 3D

+

Finally, we lift the 2D edits to 3D and merge the 3D-edited

+

regions into the original shape.

+

3D Segmentation by Lifting 2D Segmentations. We mark

+

the intended editing regions on multi-view images using

+

a green color.

+

The color-coded multi-view images are

+

then reconstructed in 3D using an offline 3D reconstruction

+

1We use x to represent both latent vectors and images, without distin-

+

guishing between the two.

+
+
+

Input

+

Tailor3D [31]

+

MVEdit [8]

+

Vox-E [36]

+

Ours

+ + + + + + + + + + + + + + + + + + + + + + + + + +

chicken dog

+

clay pizza

+

Oreo pizza

+

castle skyscraper … mustache red pepper

+

Figure 4. Qualitative comparison. Our method can perform diverse editing samples and only edit the intended regions.

+

model [50]. This model takes multiple views as input and

+

represents the shape as a triplane feature. Two separate de-

+

coders—one for Signed Distance Function (SDF) and one

+

for color— generate a geometry field and a color field, re-

+

spectively. Through this process, we create a 3D segmen-

+

tation field from the color-coded multi-view images. For

+

each 3D position within the 3D space represented by the

+

triplane, we determine whether the position lies within the

+

3D intended editing regions based on its color value. That

+

is, we apply a decision threshold for the distance between

+

each color value and the preset green color to identify tar-

+

geted regions. This produces two 3D masks, indicating the

+

intended editing regions in 3D for both the original shape

+

and the edited shape, denoted as Mi and Me, respectively.

+

Merging Edits in 3D. As aforementioned, coarse user-

+

provided masks and occlusion issues in 2D projections of-

+

ten result in unwanted alterations, compromising the preser-

+

vation of unaffected 3D regions. Therefore, the directly re-

+

constructed shapes from the edited multi-view images us-

+

ing the reconstruction model [50] cannot be the final editing

+

output, as illustrated in Fig. 7.

+

Using the 3D reconstruction model [50], we extract

+

voxel features for both original and edited shapes by inter-

+

polating their triplane features, denoted as Vi and Ve for

+

the original and edited shapes, respectively. Both Vi, Ve

+

RA×A×A×F , where A is the voxel resolution, and F is the

+

feature dimension, A = 256 and F = 40 in practice.

+

To merge 3D features, we first nullify the original spe-

+

cific regions Mi from the original voxel feature Vi, and

+

then replace the target edited regions Me with edited fea-

+

ture Ve[Me]. We write the above operations as follows:

+

Vi[Mi] ← ∅, and Vi[Me] Ve[Me].

+

(3)

+

We refer to this approach as a naive copy-paste method.

+

While theoretically plausible, we observe that this straight-

+

forward approach typically introduces discontinuities at the

+
+
+ + +

“skull holding

+

a sword”

+

“… sword

+

viking axe”

+ + +

“tomato with fork

+

behind its head”

+

“… fork worm …”

+ + +

“cowboy

+

holding a pistol”

+

.. with a robot arm ...

+ + +

“… robot dog tail”

+

“robot cat with

+

robot tail”

+ + +

“hogwarts castle with

+

main tower in the middle”

+

“main batman

+

tower …”

+ + +

“chicken in a

+

racing car”

+

chicken cat with a

+

tail in a racing car”

+ + +

“sofa with no pillow”

+

no red pillow”

+ + +

“pink circular

+

monster”

+

“… with sunglasses

+ + +

“sculpture holding

+

a stone basket”

+

“… stone fruit

+

Figure 5. More editing results from PrEditor3D. Our method can perform a wide range of editing on various shapes.

+

Method

+

Prompt Algn.

+

3D Plausibility

+

Texture

+

Overall

+

Tailor3D [31]

+

98%

+

99%

+

99%

+

99%

+

MVEdit [8]

+

57%

+

55%

+

55%

+

57%

+

Vox-E [36]

+

53%

+

68%

+

50%

+

55%

+

Table 1. Comparison using GPTEval3D [44]. Scores indicate

+

the percentage of our method being selected over baselines.

+

Prompt Algn.

+

Visual Quality

+

Preserving Shape

+

Tailor3D [31]

+

96%

+

97%

+

99%

+

MVEdit [8]

+

68%

+

68%

+

97%

+

Vox-E [36]

+

78%

+

88%

+

89%

+

Table 2.

+

User study results comparing our method against

+

baselines. The percentage shows the preference for our method.

+

3D editing boundaries, as shown in Fig. 7. To address this,

+

we propose an averaged merging approach that provides a

+

more robust blend of 3D features. In the improved method,

+

we dilate the 3D mask Me by a dilation d and then use an

+

exclusive or (a.k.a. XOR) operation to select the boundary

+

mask regions for smooth blending. Next, we linearly inter-

+

polate the two voxel features Vi and Ve within the bound-

+

ary regions using a coefficient θ, in practice θ = 0.5. We

+

illustrate the merging process in Alg. 1. After merging, we

+

generate a textured mesh from the blended voxel feature,

+

using the decoders in the 3D reconstruction model [50].

+

4. Experiments

+

4.1. Evaluation

+

Our evaluation dataset contains 18 unique shapes and 40

+

editing prompts. We use shapes from GSO [14] and Obja-

+

verse [12]. We evaluate our method based on the quality of

+

the editing and consistency with the input shapes.

+

We use the GPTEval3D [44] metric to evaluate the

+

quality of edited shapes and their alignment with the text

+

prompts. GPT-4V is provided with multi-view renderings of

+

two methods at a time, and instructed to pick one based on

+

text-prompt alignment, 3D plausibility, and texture details.

+

There were 120 total questions, each answered 3 times.

+

Since the GPTEval3D metric does not consider the in-

+

put shape, it cannot measure whether the shape remained

+

intact, and whether the edited shape is consistent with the

+

high-level style of the input shape. Therefore, we adopt

+

the directional CLIP [33] score metric, CLIPdir from pre-

+

vious works [15, 36]. CLIPdir evaluates the average dif-

+

ference between text feature change direction and image

+

feature change direction, where images are multi-view ren-

+

derings of input and edited shapes. To ensure our evalua-

+

tion is not affected by a particular implementation of this

+

metric, we introduce three variants. CLIPdir-cos replaces the

+

text-image direction vector difference with cosine distance,

+

while CLIPdir-avg and CLIPdir-avg-cos compute the same met-

+

rics by averaging image vectors first rather than scores.

+

We finally introduce two additional directional metrics.

+
+
+

Method

+

CLIPdir

+

CLIPdir-cos

+

CLIPdir-avg

+

CLIPdir-avg-cos

+

CLIPdiff-edit

+

CLIPdiff-noedit

+

Tailor3D [31]

+

1.416

+

3.710

+

1.417

+

4.593

+

12.050

+

5.619

+

MVEdit [8]

+

0.782

+

2.578

+

0.783

+

3.164

+

10.014

+

4.118

+

Vox-E [36]

+

1.622

+

6.178

+

1.621

+

8.153

+

10.528

+

3.432

+

PrEditor3D (Ours)

+

1.782

+

7.679

+

1.782

+

11.554

+

8.812

+

2.636

+

Table 3. Directional CLIP score metrics [36] for evaluating editing fidelity and prompt consistency. Our method outperforms baselines

+

across all directional CLIP metrics. Metrics are scaled by 100 to ease reading and allow for more precision.

+

Method

+

Editing

+

Merging

+

Total

+

Tailor3D [31]

+

26 sec

+

-

+

26 sec

+

MVEdit [8]

+

6 min

+

-

+

6 min

+

Vox-E[36]

+

60 min

+

15 min

+

75 min

+

PrEditor3D (Ours)

+

24 sec

+

50 sec

+

74 sec

+

Table 4. Runtime comparison. We measure the runtime of our

+

baseline methods.

+

CLIPdiff-edit is the CLIP score difference between input and

+

output image-text pairs concerning only the edited part of

+

the input and output text prompts. CLIPdiff-noedit is the CLIP

+

score difference between input and output using a fixed text

+

prompt where the edited part of the input text is replaced

+

with a generic word, i.e., “object.” These metrics enforce

+

that the CLIP text matching scores are preserved between

+

the input and edited shapes, both for edited and unedited

+

parts of the text.

+

We report all directional metrics multiplied by 100 for

+

higher precision. We refer readers to our supplementary

+

material for further details about the evaluation metrics.

+

4.2. Results

+

Our method can generate various impressive edited shapes

+

from complex input shapes and prompts. We illustrate a

+

variety of our results in Fig. 5. Our approach flexibly edits

+

various different elements of the 3D objects, for instance re-

+

placing a “sword” of a skull warrior with a “viking axe,” re-

+

sulting in coherent, seamless edits in both texture and geom-

+

etry. Our edits also follow the structure of the input shape

+

when applicable; for instance, when replacing a curvy fork

+

with a worm, the worm maintains the same curved structure

+

as the initial fork. We can also insert new objects, such as

+

a “pillow” or “sunglasses.” Our method even enables both

+

replacement and addition at the same time as in “cat with a

+

tail” example, replacing the chicken with a cat and simulta-

+

neously placing a tail at the back of the car.

+

Given the same prompt, our method can generate differ-

+

ent results with different seeds. In Fig. 6, we show gen-

+

erations of “cat” and “dog” samples with various seeds.

+

The resulting shapes vary in their head, eye, and ear struc-

+

tures with different colors and sizes. We can further con-

+

trol the various aspects of the generated shape through user

+

prompts. This is also shown in Fig. 6, where we adjust the

+

mood and appearance of the generated shape.

+ + + + + + +

“chicken cat in a racing car”

+

“chicken dog in a racing car”

+ + + +

happy

+

ginger cat …”

+

sad

+

ginger cat …”

+

ginger cat

+

with sunglasses …”

+

Figure 6. Multiple generations and detailed control through

+

prompt. Our method can generate different results for the same

+

prompt using different seeds. Moreover, Our method can handle

+

detailed prompts that can modify various aspects of the shape such

+

as appearance and mood. For instance, here we can define the type

+

of the cat (e.g. ginger cat) and the mood (e.g., happy).

+

Comparison with Baselines. Fig. 4 shows a qualitative

+

comparison of our method against several state-of-the-art

+

methods: Tailor3D [31], MVEdit [8] and Vox-E [36]. Our

+

approach shows significant improvements, in both editing

+

quality as well as consistency with the original shape.

+

Similar to our method, Vox-E allows controllable edit-

+

ing through merging but at the expense of an expensive

+

SDS-based optimization that can tend towards more global

+

changes than local ones.

+

Since Tailor3D accepts edited

+

front and back views as input, we ran their method using

+

our multi-view editing results. Tab. 1 shows a comparison

+

using GPTEval3D. While our method is consistently pre-

+

ferred, improvements are not as large since this metric does

+

not measure consistency with the input. A method could

+

globally change the shape and still achieve better results.

+

This is because this metric does not take input shape into ac-

+
+
+ + + + + + + + +

Input

+

w/o mergng

+

w/o avg mergng

+

Ours

+

Figure 7. Qualitative ablation of our merging algorithm. We

+

can keep the original parts of the input fixed. Here when we insert

+

a cat, the editing breaks the neighboring regions. Thanks to our

+

merging algorithm, we can recover the original parts of the shape.

+

Method

+

Chamfer Distance

+

w/o Merging

+

2.95

+

w/o Average Merging

+

2.29

+

Ours

+

2.28

+

Table 5. Quantitative ablation study of our merging algorithm.

+

We calculate the chamfer distance to the input shape for each abla-

+

tion. Chamfer Distance value is multiplied by 103. Our algorithm

+

is effective in keeping the edited shape consistent with the input.

+

count and only considers edited output and the prompt. To

+

complement this metric, we calculated the directional CLIP

+

score and its few variants in Tab. 3; this considers consis-

+

tency with the input, and demonstrates that our approach

+

achieves significant improvements over the baselines.

+

Perceptual Study. We prepare a perceptual study to com-

+

pare our method with three other baselines asking users

+

three different questions: “Select the one that follows the

+

following prompt more closely”, “Select the one with better

+

visual quality”, “Which example better preserves the parts

+

that were not instructed to be edited with the prompt?”.

+

There are 360 questions in total, each answered by 10 dif-

+

ferent participants, totaling 3600 responses. Results are pre-

+

sented in Tab. 2. In all questions, our method is preferred

+

over the baselines.

+

Runtime Analysis. Our method enables fast iteration, tak-

+

ing around 24 seconds to obtain initial multi-view editing

+

results. Merging then takes another 50 seconds to produce

+

a final refined shape. Tab. 4 shows a comparison with base-

+

lines, using a single RTX 3090 for measurements, except for

+

MVDream, which we run on RTX A6000. Tailor3D [31],

+

concurrent to our work, also operates fast, taking 2 seconds

+

for a forward pass using our multi-view editing results as in-

+

put (26 seconds in total). MVEdit [8] does not employ any

+

merging, performing editing in around 6 minutes. Since

+

Vox-E [36] involves a long SDS optimization process, its

+

overall inference can take around an hour.

+

4.3. Ablations

+

Tab. 5 and Fig. 7 ablate our merging approach, measuring

+

the chamfer distance between the edited and input shapes.

+

Shape Preservation through Merging. Our merging al-

+

gorithm ensures that only the regions described by the user

+

through a mask and prompt changes. In this ablation study,

+

we only do multi-view editing, and leave out the merging

+

operation. As shown in Fig. 7, without any merging, regions

+

that are not intended by the user can change. In the “chicken

+

in a racing car” example, when the user replaces the chicken

+

with a cat, some part of the car is also altered since the user

+

mask covers that area. In our merging step, we detect the

+

changed region (“cat”) and erased region (“chicken”) so that

+

we keep the rest of the shape (“car”) fixed.

+

Average Merging. After edited regions are detected, we

+

merge the voxel grids of the input and edited reconstructions

+

to preserve consistency with the input. In contrast, Vox-

+

E [36] uses copy-pasting for merging. That is, they copy

+

the detected part from the edited shape and paste it into the

+

original shape. However, a simple copy-paste approach can

+

create boundary artifacts such as gaps between the edited

+

region and the original shape, as shown in Fig. 7. To fix

+

these boundary problems, we dilate the masks. Within the

+

dilated region, we take the average of the edited shape and

+

the original shape, which provides a smoother transition.

+

Limitations.

+

Although our method can generate high-

+

quality editing results, we are limited by the 256x256 reso-

+

lution of the multi-view diffusion model, MVDream [37].

+

In addition, our method currently focuses on 3D assets

+

that can be rendered from four inward-facing views. How-

+

ever, this assumption cannot effectively capture large-scale

+

scenes, such as indoor rooms where more views within the

+

scene are needed.

+

5. Conclusion

+

We propose a fast and controllable 3D editing method

+

that can handle a wide variety of 3D shapes and editing

+

prompts.

+

We employ the editing strength of powerful

+

multi-view models, lift edits to 3D, and merge edits in 3D

+

in order to ensure unedited regions remain consistent with

+

the input shape. Hence, our method produces high-quality

+

editing results with fast runtime speeds. We believe this

+

shows a significant potential for high-quality, controllable,

+

seamless, and fast 3D editing.

+

Acknowledgments This work is partially done during

+

Ziya’s and Can’s internships at Snap. Matthias Nießner was

+

supported by the ERC Starting Grant Scan2CAD (804724)

+

and Angela Dai was supported by the ERC Starting Grant

+

SpatialSem (101076253).

+
+
+

References

+

[1] Panos Achlioptas, Ian Huang, Minhyuk Sung, Sergey

+

Tulyakov, and Leonidas Guibas.

+

Shapetalk: A language

+

dataset and framework for 3d shape edits and deformations.

+

In CVPR, 2023. 3

+

[2] Antonio Alliegro, Yawar Siddiqui, Tatiana Tommasi, and

+

Matthias Nießner. Polydiff: Generating 3d polygonal meshes

+

with diffusion models.

+

arXiv preprint arXiv:2312.11417,

+

2023. 2

+

[3] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended

+

diffusion for text-driven editing of natural images. In CVPR,

+

pages 18208–18218, 2022. 3

+

[4] Manuel Brack, Felix Friedrich, Katharia Kornmeier, Linoy

+

Tsaban,

+

Patrick Schramowski,

+

Kristian Kersting,

+

and

+

Apolin´ario Passos. Ledits++: Limitless image editing using

+

text-to-image models. In CVPR, 2024. 3

+

[5] Tim Brooks, Aleksander Holynski, and Alexei A Efros. In-

+

structpix2pix: Learning to follow image editing instructions.

+

In CVPR, 2023. 3

+

[6] Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi-

+

aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu-

+

tual self-attention control for consistent image synthesis and

+

editing. In ICCV, 2023. 3

+

[7] Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and

+

Daniel Cohen-Or.

+

Attend-and-excite: Attention-based se-

+

mantic guidance for text-to-image diffusion models. ACM

+

TOG, 2023. 3

+

[8] Hansheng Chen, Ruoxi Shi, Yulin Liu, Bokui Shen, Ji-

+

ayuan Gu, Gordon Wetzstein, Hao Su, and Leonidas Guibas.

+

Generic 3d diffusion adapter using controlled multi-view

+

editing. arXiv preprint arXiv:2403.12032, 2024. 2, 3, 5,

+

6, 7, 8

+

[9] Minghao Chen, Junyu Xie, Iro Laina, and Andrea Vedaldi.

+

Shap-editor: Instruction-guided latent 3d editing in seconds.

+

In CVPR, 2024. 2, 3

+

[10] Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fan-

+

tasia3d: Disentangling geometry and appearance for high-

+

quality text-to-3d content creation. In ICCV, 2023. 3

+

[11] Dale Decatur, Itai Lang, Kfir Aberman, and Rana Hanocka.

+

3d paintbrush: Local stylization of 3d shapes with cascaded

+

score distillation. In CVPR, 2024. 3

+

[12] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs,

+

Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana

+

Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse:

+

A universe of annotated 3d objects. In CVPR, 2023. 3, 6

+

[13] Shaocong Dong, Lihe Ding, Zhanpeng Huang, Zibin Wang,

+

Tianfan Xue, and Dan Xu. Interactive3d: Create what you

+

want by interactive 3d generation. In CVPR, 2024. 2, 3

+

[14] Laura Downs, Anthony Francis, Nate Koenig, Brandon Kin-

+

man, Ryan Hickman, Krista Reymann, Thomas B McHugh,

+

and Vincent Vanhoucke. Google scanned objects: A high-

+

quality dataset of 3d scanned household items.

+

In ICRA,

+

2022. 6

+

[15] Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano,

+

Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip-

+

guided domain adaptation of image generators. ACM TOG,

+

2022. 6, 1

+

[16] Ayaan Haque, Matthew Tancik, Alexei A Efros, Aleksander

+

Holynski, and Angjoo Kanazawa. Instruct-nerf2nerf: Edit-

+

ing 3d scenes with instructions. In ICCV, 2023. 2, 3

+

[17] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman,

+

Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image

+

editing with cross attention control. In ICLR, 2022. 2, 3, 4

+

[18] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-

+

sion probabilistic models. In NeurIPS, 2020. 3

+

[19] Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou,

+

Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao

+

Tan. Lrm: Large reconstruction model for single image to

+

3d. In ICLR, 2024. 3

+

[20] Inbar Huberman-Spiegelglas, Vladimir Kulikov, and Tomer

+

Michaeli. An edit friendly ddpm noise space: Inversion and

+

manipulations. In CVPR, 2024. 2, 3, 4, 1

+

[21] Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun

+

Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg

+

Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with

+

sparse-view generation and large reconstruction model. In

+

ICLR, 2024. 3

+

[22] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa,

+

Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler,

+

Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution

+

text-to-3d content creation. In CVPR, 2023. 2, 3

+

[23] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao

+

Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang,

+

Hang Su, et al.

+

Grounding dino:

+

Marrying dino with

+

grounded pre-training for open-set object detection. arXiv

+

preprint arXiv:2303.05499, 2023. 2, 4, 1

+

[24] Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu,

+

Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang,

+

Marc Habermann, Christian Theobalt, et al. Wonder3d: Sin-

+

gle image to 3d using cross-domain diffusion.

+

In CVPR,

+

2023. 3

+

[25] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia-

+

jun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided

+

image synthesis and editing with stochastic differential equa-

+

tions. In ICLR, 2022. 3

+

[26] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik,

+

Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:

+

Representing scenes as neural radiance fields for view syn-

+

thesis. In ECCV, 2020. 3

+

[27] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and

+

Daniel Cohen-Or. Null-text inversion for editing real images

+

using guided diffusion models. In CVPR, 2023. 3

+

[28] Bharath Raj Nagoor Kani, Hsin-Ying Lee, Sergey Tulyakov,

+

and Shubham Tulsiani. Upfusion: Novel view diffusion from

+

unposed sparse view observations. In ECCV, 2025. 3

+

[29] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav

+

Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and

+

Mark Chen. Glide: Towards photorealistic image generation

+

and editing with text-guided diffusion models. arXiv preprint

+

arXiv:2112.10741, 2021. 3

+

[30] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden-

+

hall. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR,

+

2023. 3

+
+
+

[31] Zhangyang Qi, Yunhan Yang, Mengchen Zhang, Long Xing,

+

Xiaoyang Wu, Tong Wu, Dahua Lin, Xihui Liu, Jiaqi Wang,

+

and Hengshuang Zhao. Tailor3d: Customized 3d assets edit-

+

ing and generation with dual-side images.

+

arXiv preprint

+

arXiv:2407.06191, 2024. 2, 3, 5, 6, 7, 8

+

[32] Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren,

+

Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Sko-

+

rokhodov, Peter Wonka, Sergey Tulyakov, et al. Magic123:

+

One image to high-quality 3d object generation using both

+

2d and 3d diffusion priors. In ICLR, 2024. 3

+

[33] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya

+

Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,

+

Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-

+

ing transferable visual models from natural language super-

+

vision. In ICML, 2021. 6, 1

+

[34] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang

+

Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman

+

R¨adle, Chloe Rolland, Laura Gustafson, et al.

+

Sam 2:

+

Segment anything in images and videos.

+

arXiv preprint

+

arXiv:2408.00714, 2024. 2, 4, 1

+

[35] Robin Rombach, Andreas Blattmann, Dominik Lorenz,

+

Patrick Esser, and Bj¨orn Ommer. High-resolution image syn-

+

thesis with latent diffusion models. In CVPR, 2022. 3

+

[36] Etai Sella, Gal Fiebelman, Peter Hedman, and Hadar

+

Averbuch-Elor. Vox-e: Text-guided voxel editing of 3d ob-

+

jects. In ICCV, 2023. 2, 3, 5, 6, 7, 8, 1

+

[37] Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li,

+

and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen-

+

eration. In ICLR, 2024. 2, 4, 8

+

[38] Yawar Siddiqui, Antonio Alliegro, Alexey Artemov, Tatiana

+

Tommasi, Daniele Sirigatti, Vladislav Rosov, Angela Dai,

+

and Matthias Nießner. Meshgpt: Generating triangle meshes

+

with decoder-only transformers. In CVPR, 2024. 2

+

[39] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-

+

ing diffusion implicit models. In ICLR, 2020. 3

+

[40] Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang,

+

Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian

+

model for high-resolution 3d content creation. arXiv preprint

+

arXiv:2402.05054, 2024. 3

+

[41] Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh,

+

and Greg Shakhnarovich. Score jacobian chaining: Lifting

+

pretrained 2d diffusion models for 3d generation. In CVPR,

+

2023. 3

+

[42] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan

+

Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and

+

diverse text-to-3d generation with variational score distilla-

+

tion. In NeurIPS, 2023. 3

+

[43] Zhengyi Wang, Yikai Wang, Yifei Chen, Chendong Xi-

+

ang, Shuo Chen, Dajiang Yu, Chongxuan Li, Hang Su,

+

and Jun Zhu.

+

Crm: Single image to 3d textured mesh

+

with convolutional reconstruction model.

+

arXiv preprint

+

arXiv:2403.05034, 2024. 3

+

[44] Tong Wu, Guandao Yang, Zhibing Li, Kai Zhang, Ziwei Liu,

+

Leonidas Guibas, Dahua Lin, and Gordon Wetzstein. Gpt-4v

+

(ision) is a human-aligned evaluator for text-to-3d genera-

+

tion. In CVPR, 2024. 2, 6

+

[45] Jiale Xu, Xintao Wang, Yan-Pei Cao, Weihao Cheng, Ying

+

Shan, and Shenghua Gao.

+

Instructp2p: Learning to edit

+

3d point clouds with text instructions.

+

arXiv preprint

+

arXiv:2306.07154, 2023. 3

+

[46] Yinghao Xu, Hao Tan, Fujun Luan, Sai Bi, Peng Wang, Ji-

+

ahao Li, Zifan Shi, Kalyan Sunkavalli, Gordon Wetzstein,

+

Zexiang Xu, et al. Dmv3d: Denoising multi-view diffusion

+

using 3d large reconstruction model. In ICLR, 2024. 3

+

[47] Taoran Yi, Jiemin Fang, Junjie Wang, Guanjun Wu, Lingxi

+

Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, and Xinggang

+

Wang. Gaussiandreamer: Fast generation from text to 3d

+

gaussians by bridging 2d and 3d diffusion models. In CVPR,

+

2024. 2

+

[48] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa.

+

pixelNeRF: Neural radiance fields from one or few images.

+

In CVPR, 2021. 3

+

[49] Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su.

+

Magicbrush: A manually annotated dataset for instruction-

+

guided image editing. In NeurIPS, 2024. 3

+

[50] Peiye Zhuang, Songfang Han, Chaoyang Wang, Aliak-

+

sandr Siarohin, Jiaxu Zou, Michael Vasilkovsky, Vladislav

+

Shakhrai, Sergey Korolev, Sergey Tulyakov, and Hsin-

+

Ying Lee. Gtr: Improving large 3d reconstruction models

+

through geometry and texture refinement.

+

arXiv preprint

+

arXiv:2406.05649, 2024. 2, 3, 5, 6, 1

+
+
+

6. Appendix

+

We present additional details about PrEditor3D in this ap-

+

pendix. We start by explaining some of the implementa-

+

tion details in Sec. 6.1. In Sec. 6.2, we discuss automatic

+

masking, an alternative to user-brushed masking. Sec. 6.3

+

follows this discussion with the effect of mask granularity

+

on the editing process. Finally, we explain the directional

+

CLIP metrics we used for baseline comparison in Sec. 6.4.

+

6.1. Implementation Details

+

We used the official implementation and checkpoint of MV-

+

Dream as our multi-view diffusion model. It has 256 x 256

+

resolution and it can generate four views by default. In all of

+

our generations, we set the classifier-free guidance scale of

+

the diffusion process to 10. Official DDPM inversion [20]

+

implementation only handles single-image but we modified

+

it to handle our four view renderings. The inversion process

+

takes 9 seconds on RTX 3090. With the inverted latents,

+

we ran our inference for 41 steps, which takes around 12

+

seconds on an RTX 3090. For the segmentation, we cal-

+

culate bounding boxes using Grounding DINO [23] for all

+

views and add these as constraints to SAM 2 [34] tracking.

+

That is to help SAM 2 with the segmentation, we constrain

+

each frame separately. For merging and reconstruction, we

+

modify GTR [50], which is a feed-forward reconstruction

+

model. GTR mainly operates on triplanes but just before re-

+

construction, those triplanes are converted into a voxel grid.

+

We manipulated the voxel grid it generated to merge two

+

different shapes.

+

6.2. Automatic Masking

+

In addition to user-brushed masks, we can also gener-

+

ate and operate on automatically generated masks. Even

+

though they limit the editing region, when compared to

+

user-brushed masks; they can be practically used as a start-

+

ing point for user-brushed masking.

+

We leverage our segmentation approach to replace masks

+

given by the user. We use an input prompt from the user

+

to detect the target region using Grounding DINO [23] and

+

SAM 2 [34]. This segmentation method gives us a mask

+

restricted only to the sword. As a result, the generation pro-

+

cess cannot go beyond that region. However, when we ac-

+

cept input from user masks, user can explicitly show their

+

intention with the mask and can generate a ”viking axe”, as

+

shown in Fig. 9.

+

We want to reiterate that although the user-brushed

+

masks are too coarse and not 3D-consistent, our method can

+

generate impressive results without modifying the original

+

parts of the shape. That is, a quickly drawn mask is enough

+

for our method to work.

+ +

0

+ +

+10

+ +

+20

+ +

+30

+ +

-10

+ + + + + + + + + + + + + + + + + + + + +

Mask

+

Edited

+

Shape

+

Dilation

+

Figure 8. Different granularity of masking. Too fine-grained

+

masks can over-constrain the generation process since they only

+

point to the region to be replaced but do not include the user’s in-

+

tention. More dilation increases flexibility but can also edit more

+

regions than intended (e.g., the region underneath the cat). Nega-

+

tive dilation means erosion.

+

6.3. Mask Granularity

+

We experimented with different granularity levels for the

+

input masks. We started with a mask that we detected au-

+

tomatically using Grounding DINO [23] and SAM 2 [34].

+

As shown in Fig. 8. If we use the original segmentation,

+

then the generation is restricted to that certain region and

+

the model cannot have room to add ”cat” features. That is,

+

it tries to follow the shape of the original chicken. As we

+

add more dilation, it tries to add features like cat ears. This

+

shows the trade-off between loyalty to input and flexibility.

+

Based on this observation, we gave coarse masks as input

+

and allowed the model to edit flexibly. Thanks to our merg-

+

ing approach, we could still combine the edited region with

+

the original shape to keep the rest intact.

+

6.4. Directional CLIP Metrics

+

In Sec. 4.1-4.2, we discuss directional CLIP score met-

+

rics [15, 33, 36] to evaluate 3D editing fidelity, to comple-

+

ment other quantitative metrics that measure the quality of

+

the output shape. We report directional CLIP scores of dif-

+

ferent methods in Tab. 3 of the main paper. In this section,

+

we formally define and discuss the reported metrics.

+

CLIPdir = 1

+

N

+

N

+

+

i=1

+

< F i

+

IE F i

+

II, FT E FT I >,

+

(4)

+

where < ., . > refers to an inner product, F i

+

IE, F i

+

II are

+

the normalized CLIP image embeddings over rendered im-

+

ages of input and edited shapes, indexed by i, and FT E, FT I

+

are the corresponding normalized text embeddings of edited

+

and input prompts. i indexes a particular frame, while N

+

is the total number of rendered frames. In our directional

+

CLIP evaluations, we use N = 70 views rendered over

+

a 360 trajectory, significantly larger than the four input

+

views we use for our method and the baseline methods.

+
+
+

Automatically Generated Mask

+

User-Brushed Mask

+ + + + + + + + + + +

Figure 9. Comparing automatically generated mask to user-

+

generated mask. Users may want to do specific editing such as

+

replacing the “sword” with “a viking axe”. If we only rely on

+

automatic masking, the result may not follow the user’s intention

+

since the automatically generated mask can limit the editing to a

+

certain region. However, when we rely on explicit masking, we

+

can get the specific shape requested by the user.

+

We also introduce additional metrics inspired by CLIPdir,

+

but aim to fix some of its problems. First, we define

+

CLIPdir-cos = 1

+

N

+

N

+

+

i=1

+

C(F i

+

IE F i

+

II, FT E FT I),

+

(5)

+

where C(., .) is the cosine distance.

+

We also introduce two modified versions of these met-

+

rics, namely

+

CLIPdir-avg =< 1

+

N

+

N

+

+

i=1

+

F i

+

IE F i

+

II, FT E FT I >

+

(6)

+

CLIPdir-avg-cos = C( 1

+

N

+

N

+

+

i=1

+

F i

+

IE F i

+

II, FT E FT I)

+

(7)

+

that compute the same metrics over the average image em-

+

beddings instead of averaging scores to ensure further ro-

+

bustness.

+

We also propose two similarity change error metrics,

+

CLIPdiff-edit and CLIPdiff-noedit

+

CLIPdiff-edit = 1

+

N

+

N

+

+

i=1

+

|C(F i

+

II, FT W ) C(F i

+

IE, FT W )|rel

+

(8)

+

CLIPdiff-noedit = 1

+

N

+

N

+

+

i=1

+

|C(F i

+

II, FT G) C(F i

+

IE, FT G)|rel.

+

(9)

+

Here, |x y|rel =

+

|xy|

+

max(x,y), FT W is the text embed-

+

ding of the edited word or phrase, and FT G represents the

+

”generic” text. For instance, when the prompt “a chicken

+

riding a bike” becomes “cat riding a bike”, FT W embeds the

+

text “cat” and FT G embeds the text “object riding a bike”.

+

By measuring similarity differences of rendered images to

+

FT W and FT G, we aim to measure the preservation of the

+

object and context semantics, respectively.

+
+ + \ No newline at end of file diff --git a/packages/metascraper-readability/benchmark/index.js b/packages/metascraper-readability/benchmark/index.js new file mode 100644 index 000000000..b550a59b3 --- /dev/null +++ b/packages/metascraper-readability/benchmark/index.js @@ -0,0 +1,42 @@ +'use strict' + +const { readFileSync } = require('fs') + +const url = 'https://arxiv.org/pdf/2412.06592' +const html = readFileSync('./fixture.html', 'utf8') + +const jsdom = () => { + const { JSDOM, VirtualConsole } = require('jsdom') + const dom = new JSDOM(html, { url, virtualConsole: new VirtualConsole() }) + return dom.window.document +} + +const happydom = () => { + const { Window } = require('happy-dom') + const window = new Window({ url }) + const document = window.document + document.documentElement.innerHTML = html + return document +} + +const { Readability } = require('@mozilla/readability') + +const measure = fn => { + const now = Date.now() + const parsed = new Readability(fn()).parse() + return { parsed, duration: Date.now() - now } +} + +const jsdomResult = measure(jsdom) +const happydomResult = measure(happydom) + +const isEqual = (value1, value2) => + JSON.stringify(value1) === JSON.stringify(value2) + +if (!isEqual(jsdomResult.parsed, happydomResult.parsed)) { + console.error('Results are different') + process.exit(1) +} + +console.log(` jsdom: ${jsdomResult.duration}ms`) +console.log(`happydom: ${happydomResult.duration}ms`) diff --git a/packages/metascraper-readability/benchmark/package.json b/packages/metascraper-readability/benchmark/package.json new file mode 100644 index 000000000..cdd63c1d4 --- /dev/null +++ b/packages/metascraper-readability/benchmark/package.json @@ -0,0 +1,9 @@ +{ + "name": "@metascraper-readability/benchmark", + "private": true, + "version": "1.0.0", + "devDependencies": { + "dom-parser": "latest", + "happy-dom": "latest" + } +} diff --git a/packages/metascraper-readability/package.json b/packages/metascraper-readability/package.json index 0da91c17e..9c1f29bc9 100644 --- a/packages/metascraper-readability/package.json +++ b/packages/metascraper-readability/package.json @@ -25,7 +25,7 @@ "dependencies": { "@metascraper/helpers": "workspace:*", "@mozilla/readability": "~0.5.0", - "jsdom": "~25.0.1" + "happy-dom": "~16.5.3" }, "devDependencies": { "ava": "5", diff --git a/packages/metascraper-readability/src/index.js b/packages/metascraper-readability/src/index.js index 1c39083c0..599c3b020 100644 --- a/packages/metascraper-readability/src/index.js +++ b/packages/metascraper-readability/src/index.js @@ -1,9 +1,7 @@ 'use strict' const { memoizeOne, composeRule } = require('@metascraper/helpers') - const { Readability } = require('@mozilla/readability') -const { JSDOM, VirtualConsole } = require('jsdom') const parseReader = reader => { try { @@ -13,15 +11,25 @@ const parseReader = reader => { } } -const readability = memoizeOne((url, html) => { - const dom = new JSDOM(html, { url, virtualConsole: new VirtualConsole() }) - const reader = new Readability(dom.window.document) - return parseReader(reader) -}, memoizeOne.EqualityFirstArgument) +const defaultGetDocument = ({ url, html }) => { + const { Window } = require('happy-dom') + const window = new Window({ url }) + const document = window.document + document.documentElement.innerHTML = html + return document +} + +module.exports = ({ getDocument = defaultGetDocument } = {}) => { + const readability = memoizeOne((url, html, getDocument) => { + const document = getDocument({ url, html }) + const reader = new Readability(document) + return parseReader(reader) + }, memoizeOne.EqualityFirstArgument) -const getReadbility = composeRule(($, url) => readability(url, $.html())) + const getReadbility = composeRule(($, url) => + readability(url, $.html(), getDocument) + ) -module.exports = () => { return { author: getReadbility({ from: 'byline', to: 'author' }), description: getReadbility({ from: 'excerpt', to: 'description' }), diff --git a/packages/metascraper-readability/test/snapshots/index.js.md b/packages/metascraper-readability/test/snapshots/index.js.md index ff8a6a2a5..8ad5da2e7 100644 --- a/packages/metascraper-readability/test/snapshots/index.js.md +++ b/packages/metascraper-readability/test/snapshots/index.js.md @@ -46,8 +46,8 @@ Generated by [AVA](https://avajs.dev). { author: null, - description: null, + description: 'Virtual Tour of 219 Shale Rd.', lang: null, publisher: null, - title: null, + title: '219 Shale Rd - Virtual Tour', } diff --git a/packages/metascraper-readability/test/snapshots/index.js.snap b/packages/metascraper-readability/test/snapshots/index.js.snap index 1ce165382..6507358ba 100644 Binary files a/packages/metascraper-readability/test/snapshots/index.js.snap and b/packages/metascraper-readability/test/snapshots/index.js.snap differ