diff --git a/packages/metascraper-readability/README.md b/packages/metascraper-readability/README.md
index f9f6af968..c2d4ccfc8 100644
--- a/packages/metascraper-readability/README.md
+++ b/packages/metascraper-readability/README.md
@@ -14,6 +14,19 @@
$ npm install metascraper-readability --save
```
+## API
+
+### metascraper-readability([options])
+
+#### options
+
+##### getDocument
+
+Type: `function`
+Default: [source code](https://github.com/microlinkhq/metascraper/blob/master/packages/metascraper-readability/src/index.js#L14-L20)
+
+The function to be called to serialized html into a DOM document.
+
## License
**metascraper-readability** © [Microlink](https://microlink.io), released under the [MIT](https://github.com/microlinkhq/metascraper/blob/master/LICENSE.md) License.
diff --git a/packages/metascraper-readability/benchmark/fixture.html b/packages/metascraper-readability/benchmark/fixture.html
new file mode 100644
index 000000000..abf16abe3
--- /dev/null
+++ b/packages/metascraper-readability/benchmark/fixture.html
@@ -0,0 +1,1345 @@
+
PrEditor3D: Fast and Precise 3D Shape Editing
+Ziya Erkoc¸1
+Can G¨umeli1
+Chaoyang Wang2
+Matthias Nießner1
+Angela Dai1
+Peter Wonka2,3
+Hsin-Ying Lee2
+Peiye Zhuang2
+1Technical University of Munich
+2Snap Inc
+3King Abdullah University of Science and Technology
+https://ziyaerkoc.com/preditor3d
+ +1
+3
+5
+7
+ +“chicken in a racing car”
+“viking with a
+mustache, helmet”
+“... helmet party hat”
+“… red pepper
+mustache …”
+“… racing car jeep”
+“chicken cat …”
+ + +“cannon
+in a wooden cart”
+“cannon
+flamethrower …”
+ +“cannon
+trebuchet …”
+ + +“truck carrying clay”
+“… clay pizza”
+Time
+CLIPdir-cos
+1 min
+4 min
+1 hr
+Vox-E
+Ours
+MV-Edit
+Tailor3D
+…
+ + + + + +“chicken in a
+racing car”
+“chicken
+cat with a tail …”
+ + + + + + + + +chicken dog …
+ + + + +“android”
+“android chicken
+wearing a tie”
+ + +“house in a forest”
+“house treasure
+cave …”
+Figure 1. PrEditor3D is a (top) fast and high-quality editing method that can perform precise and consistent editing only in the intended
+regions, keeping the rest identical. (mid) It can handle diverse editing prompts with any given 3D object. (bottom) Furthermore, it can
+support iterative editing, facilitating artistic workflow, and can also support editing multiple regions in a single run.
+Abstract
+We propose a training-free approach to 3D editing that
+enables the editing of a single shape within a few minutes.
+The edited 3D mesh aligns well with the prompts, and re-
+mains identical for regions that are not intended to be al-
+tered. To this end, we first project the 3D object onto 4-view
+images and perform synchronized multi-view image edit-
+ing along with user-guided text prompts and user-provided
+rough masks. However, the targeted regions to be edited
+are ambiguous due to projection from 3D to 2D. To en-
+sure precise editing only in intended regions, we develop
+a 3D segmentation pipeline that detects edited areas in 3D
+space, followed by a merging algorithm to seamlessly in-
+tegrate edited 3D regions with the original input. Exten-
+sive experiments demonstrate the superiority of our method
+over previous approaches, enabling fast, high-quality edit-
+ing while preserving unintended regions.
+arXiv:2412.06592v1 [cs.CV] 9 Dec 2024
+1. Introduction
+Recent 3D diffusion models can generate high-quality as-
+sets that closely align with the text prompts in the form of
+neural fields [22, 37], meshes [2, 38], or Gaussian point
+clouds [47]. Although these methods generate impressive
+results, they lack the essential capability for precise and
+controllable editing of the generated outputs, a critical re-
+quirement for iterative artistic workflows. Effective 3D edit-
+ing demands: (1) it should be fast enough to provide quick
+feedback, ideally comparable to fast 3D generation algo-
+rithms, and (2) it must allow for precise local control, en-
+abling users to keep specific parts of the model unchanged.
+Enabling precise and controllable editing is still an open
+challenge. Several initial approaches have been proposed to
+tackle the challenge of 3D editing [8, 9, 13, 16, 31, 36], pro-
+viding promising results but suffering from slow runtime,
+lack of precise control, and/or lack of 3D consistency and
+quality. Optimization-based techniques like SDS [9, 13, 36]
+or multiview training dataset updates [16] are computation-
+ally expensive, making interactive editing out of reach.
+Additionally, they offer limited control over specific
+parts of the shape, as text prompts alone cannot precisely
+localize regions to be edited [8, 9, 16, 31, 36]. While Vox-
+E [36] and Shap-Editor [9] propose a mechanism to prevent
+original parts of the shape from being altered during editing,
+they do not enable precise editing due to having only text as
+input. Finally, one can observe various visual quality prob-
+lems, such as the Janus problem, blurring, over-saturation,
+and overemphasizing texture changes while leaving the ge-
+ometry intact or degrading.
+To address these challenges, we propose a novel editing
+pipeline for 3D assets that is faster, more precise, and de-
+livers high-quality results (See Fig. 1). As our primary goal
+is faster editing, we propose an editing framework lever-
+aging a pipeline that consists of two components: a multi-
+view diffusion algorithm and a feed-forward mesh recon-
+struction. Multi-view diffusion models can leverage supe-
+rior 2D editing techniques, and the feed-forward mesh re-
+construction bridges the gap between 2D and 3D. For bet-
+ter controllability, we extend multi-view image generation
+to multi-view image editing using 2D masks to constrain
+the edits to user-specified regions. The 2D masks can take
+various forms, including manually selected regions, hand-
+brushed areas, or automatically generated segmentations.
+We adopt DDPM inversion [20] to extract initial noise vec-
+tors from input multi-view images and execute Prompt-to-
+Prompt [17] on a multi-view diffusion model [37]. We use
+2D user-provided masks to blend edited and original views
+during the denoising. However, due to the inherent ambigu-
+ity caused by projection from 3D to 2D, we cannot obtain
+ideal intended regions in 2D regardless of the granularity of
+the masks, as shown in Fig. 2. However, the masks are often
+too rough to precisely capture the intended semantic editing
+ + + + + + +Coarse
+Fine
+Limited editing regions
+3D à 2D
+User-Provided mask granularity
+Alter unintended regions
+Figure 2. Ambiguous intended regions. The intended region to
+be edited is clear in 3D (e.g. the cat tail). However, after projecting
+to 2D, regardless of the granularity of the user-provided masks, the
+editing will either alter some unintended regions (e.g. the robot
+cat) or be too limited for reasonable editing.
+regions. The masks are either too coarse so the unintended
+regions will be changed, or too fine-grained to allow rea-
+sonable editing. Without additional spatial information in
+3D, multi-view editing approaches cannot fully address this
+challenge. Therefore, simply adopting a feed-forward re-
+construction method [50] to convert edited multi-views into
+a 3D mesh often leads to undesirable results.
+To tackle this issue, we propose using the original 3D in-
+put and 3D segmentation. We first detect the intended edit-
+ing region using Grounding DINO [23] and SAM 2 [34]
+with the user mask and prompts. This gives an initial 2D
+segmentation that we subsequently lift to 3D. For this pur-
+pose, we use color coding in a multi-view to 3D reconstruc-
+tion pipeline, named GTR [50], to end up with a 3D seg-
+mentation that we can use during merging. Specifically, we
+paint 2D segmentations in a specific color, (e.g., green) and
+after reconstruction, we can detect which regions are edited
+by querying the color in the 3D field.
+Then, we perform merging to maintain the original parts
+of the shape. We use 3D masks to detect edited/replaced
+parts in GTR’s [50] voxel-feature space. We extract the
+edited part from the new shape and replace it with the old
+part from the original shape. That way, we guarantee that
+the remaining shape will remain identical. We apply a final
+average blending operation so that new parts and the origi-
+nal shape blend smoothly.
+In summary, we make the following contributions:
+• We propose a novel method for diffusion-based 3D ob-
+ject editing that is faster than previous work and enables
+precise, interactive editing.
+• The proposed method consists of user-guided and multi-
+view-synchronized editing and a feed-forward 3D recon-
+struction, enabling fast editing in a feed-forward manner.
+• To enable precise editing only to intended regions, we
+propose a voxel-based 3D segmentation method that uti-
+lizes multi-view segmentation information and propa-
+gates it to 3D, followed by an average blending operation
+to merge the edited and original objects.
+• Our editing method has superior quality to previous work.
+We show significant improvements in GPTEval3D [44],
+directional CLIP metrics, and extensive user studies.
+Or�g�nal Voxel Feature
+Ed�ted Voxel Feature
+Ed�ted Mult�-V�ew
+“ch�cken
+happy g�nger cat ...”
+“ch�cken �n
+a rac�ng car”
+ + + + + + + + + + + + + + + + + + + + + + +Rendered Mult�-V�ew
+User-Prov�ded Mask
+ + +Ground�ng + SAM 2
+ +3D Reconstruct�on
+ +Merg�ng & Render�ng
+Ed�ted 3D
+Input 3D
+ +Mult�
+ +-
+ +V
+ +�ew
+ +Ed�t�ng
+Figure 3. Overview of PrEditor3D. Given an input 3D object, we first render its multi-view images from 4 orthogonal views. We then
+obtain editing input from the user, describing in text as well as rough 2D masks the desired edits. We perform synchronized multi-view
+editing based on the text prompts as well as the user-provided masks (Sec.3.1. Due to the rough masks and the unclear intended regions
+caused by ambiguous 3D-2D projection, we detect the intended regions with Grounding Dino and SAM 2 (Sec.3.2, where the segmentation
+results are lifted to 3D for the final merging operation (Sec.3.3).
+2. Related Works
+2D Editing applies global or local modifications to an im-
+age based on user instructions.
+It has gained significant
+attention for enabling a more interactive user experience
+in content creation. To achieve this, existing methods ei-
+ther fine-tune text-to-image models with specialized in-
+structional editing datasets [45, 49], or use training-free ap-
+proaches with inpainting [3, 29] or cross-attention mech-
+anisms [6, 7, 17]. To edit user-provided images, various
+diffusion inversion techniques have been developed. For
+example, SDEdit [25] introduces noise to images to capture
+intermediate steps in the diffusion process, null-text inver-
+sion [27] inverts the deterministic DDIM inversion [39], and
+recent works [4, 20] use DDPM inversion [18] to enhance
+editing capabilities.
+In this work, one step to achieve 3D editing is to per-
+form 2D multi-view image editing. We apply the DDPM
+inversion [18] and Prompt2Prompt [17] to operate within a
+diffusion-based sparse multi-view generation model.
+3D Reconstruction from Sparse Multi-Views refers to the
+task of reconstructing a 3D instance from a limited number
+of multi-view images. To achieve this, Score Distillation
+Sampling (SDS) [30] and its variants [10, 22, 32, 41, 42]
+optimize a 3D scene representation by reconstructing the
+given sparse views and generating novel views through gra-
+dients from large-scale pre-trained text-to-image diffusion
+models [35]. However, these approaches are often com-
+putationally demanding. Beyond per-scene optimization,
+PixelNeRF [48] generates a NeRF representation [26] of
+a scene from sparse-view images by training the model
+across multiple scenes. With recent advancements in large-
+scale 3D datasets such as Objaverse [12], follow-up works
+[19, 21, 24, 28, 40, 43, 46, 50] have improved this feed-
+forward 3D reconstruction approach for object-centric 3D
+assets. In this work, we utilize one recent method, GTR
+[50], to quickly generate 3D meshes from 4 views.
+3D Editing presents additional challenges compared to 2D
+editing due to the need to maintain spatial consistency in
+3D. To address this, some methods train 3D editing models
+using paired 3D datasets [1, 45], however, these approaches
+are limited by a lack of diverse and complex datasets. Other
+methods adapt 2D editing models, such as InstructPix2Pix
+[5], for the 3D domain [16, 31], by iteratively updating
+multi-view images of a scene. However, without synchro-
+nized multi-view updates, this approach often results in
+flickering or inconsistent views. Alternatively, some meth-
+ods propose a generation-reconstruction loop to modify 3D
+representations using intermediate denoised images [8] or
+SDS gradients [9, 11, 13, 36] in the diffusion process. While
+these methods can achieve 3D consistency, they often strug-
+gle with quality or suffer from high computational costs. In
+our work, we perform multi-view editing and reconstruct
+the edited object in 3D. Beyond that, to preserve unchanged
+regions, we carefully design an approach by detecting the
+edited 3D regions and integrating the intended 3D edited
+regions into the original shapes.
+3. Method
+We aim to achieve fast 3D asset editing in a training-free
+manner, allowing for precise and user-guided edits. Our
+approach achieves this through multi-view image editing
+in 2D, followed by lifting the 2D edits into 3D. This pro-
+cess can be summarized into 3 main steps: (1) synchronized
+sparse multi-view editing in 2D (Sec.3.1), (2) detecting in-
+tended editing regions across 2D views through the Ground-
+ing Dino [23] and SAM 2 [34] approach (Sec.3.2), and (3)
+lifting the intended editing regions to 3D and merging the
+edited shape into the original (Sec.3.3). Our approach is
+illustrated in Fig. 3.
+3.1. Synchronized Sparse Multi-View Editing
+To edit a given 3D object O, we leverage the power of
+2D editing through multi-view diffusion models. We first
+perform synchronized sparse multi-view edits using a pre-
+trained multi-view diffusion model. In practice, we use MV-
+Dream [37], which generates 4 orthogonal views based on a
+text prompt. Note that our approach remains agnostic to the
+specific diffusion-based multi-view generation model used.
+We first render multi-view images from O, then apply
+the DDPM diffusion inversion mechanism [20] to revert the
+multi-view images to their initial noise vectors, denoted as
+xT , where T is the number of diffusion timesteps during
+the denoising process. We will use these vectors xT as the
+initial latent vectors for the editing process in diffusion. We
+denote the text prompt for the input shape and the edited
+shape as yi and ye, respectively. Also, to enable better edit-
+ing control and interaction, we further take user-provided
+masks in 4 views as input. These masks indicate target re-
+gions for the edit, denoted as Muser ∈ R4×H×W where H
+and W are the height and weight of the images. Note that
+our method does not require precise and accurate masks.
+To edit multi-view images, we apply the Prompt-to-
+Prompt [17] approach on the multi-view diffusion model.
+For simplicity, we present the basic operation at each diffu-
+sion timestep and within each attention block. To be spe-
+cific, Prompt-to-Prompt [17] generates an edited latent vec-
+tor x′
+e by replacing the self- and cross-attention weights of
+the original latent vector, xi, with the edited prompt ye. To
+confine modifications to the desired regions, we blend the
+latent vectors x′
+e and xi using user-provided masks Muser.
+We denote the final edited latent vector at each step as xe:
+xe ← Muser · x′
+e + (1 − Muser) · xi.
+(1)
+In practice, we downsample the user-provided masks Muser
+to match the feature resolution at each model layer, and the
+edited latent vector, xe, serves as the input to the next model
+layer.
+Using the inverted noise vector xT ensures that the
+edited results align with the original texture style. How-
+ever, the masks Muser are often imprecise or misaligned
+with the target regions, which can still affect regions that are
+not intended to be altered. Furthermore, occlusions along
+the depth dimension introduce additional challenges in ac-
+curately and semantically localizing the intended editing re-
+gions in 3D. We illustrate the issue in Fig. 2. To address this,
+we apply an automated grounding approach that detects the
+intended editing region in both 2D and 3D.
+Algorithm 1: Merging Voxel Features
+Input: Vi and Ve ∈ RA×A×A×F , Mi and Me ∈
+RA×A×A, d ∈ N, and θ ∈ [0, 1]
+Output: Vblend ∈ RA×A×A×F
+1 Vi[Mi] ← ∅
+2 Vi[Me] ← Ve[Me]
+3 N ← Dilation(Me, d)
+4 K ← Me ⊕ N // ⊕ means XOR
+5 Vblend ← Vi
+6 Vblend[K] ← θ ⊙ Vi[K] + (1 − θ) ⊙ Ve[K]
+7 return Vblend
+3.2. Detection of Intended Editing Regions in 2D
+The intended editing regions refer to the specific semantic
+areas that correspond to the editing prompt. For example, in
+Fig. 1, changing “chicken” to “cat” implies that the desired
+editing region pertains only to the areas representing the
+“chicken” and “cat” concepts. Regions (both 2D and 3D)
+that are not semantically related to these concepts should
+remain unchanged.
+To ensure that only intended editing regions are edited
+while allowing rough user-provided masks, we propose to
+detect the intended editing regions by applying Grounding
+Dino [23] and SAM 2 [34] to both the original and the
+edited multi-view images. To begin with, we identify the
+changing concept by comparing the original prompt yi and
+the editing prompt ye. We then localize the changing con-
+cept within the user-provided mask regions Muser and ob-
+tain corresponding bounding boxes in multiple views. For-
+mally, we write the procedure as
+bboxx ← Grounding(x, Muser, yi, ye),
+(2)
+where bboxx are the bounding boxes for the changing con-
+cept in the multi-view images x ∈ {xi, xe}.1
+Afterward, we segment and track the changing ele-
+ments across views using the grounding bounding boxes
+bboxx.
+Formally, this procedure can be described as
+SAM(x, bboxx). This process yields the intended editing
+regions in segmentation format for both the original and
+edited multi-views.
+3.3. Lifting and Merging Edits in 3D
+Finally, we lift the 2D edits to 3D and merge the 3D-edited
+regions into the original shape.
+3D Segmentation by Lifting 2D Segmentations. We mark
+the intended editing regions on multi-view images using
+a green color.
+The color-coded multi-view images are
+then reconstructed in 3D using an offline 3D reconstruction
+1We use x to represent both latent vectors and images, without distin-
+guishing between the two.
+Input
+Tailor3D [31]
+MVEdit [8]
+Vox-E [36]
+Ours
+ + + + + + + + + + + + + + + + + + + + + + + + + +chicken dog …
+clay pizza …
+Oreo pizza …
+castle skyscraper … mustache red pepper …
+Figure 4. Qualitative comparison. Our method can perform diverse editing samples and only edit the intended regions.
+model [50]. This model takes multiple views as input and
+represents the shape as a triplane feature. Two separate de-
+coders—one for Signed Distance Function (SDF) and one
+for color— generate a geometry field and a color field, re-
+spectively. Through this process, we create a 3D segmen-
+tation field from the color-coded multi-view images. For
+each 3D position within the 3D space represented by the
+triplane, we determine whether the position lies within the
+3D intended editing regions based on its color value. That
+is, we apply a decision threshold for the distance between
+each color value and the preset green color to identify tar-
+geted regions. This produces two 3D masks, indicating the
+intended editing regions in 3D for both the original shape
+and the edited shape, denoted as Mi and Me, respectively.
+Merging Edits in 3D. As aforementioned, coarse user-
+provided masks and occlusion issues in 2D projections of-
+ten result in unwanted alterations, compromising the preser-
+vation of unaffected 3D regions. Therefore, the directly re-
+constructed shapes from the edited multi-view images us-
+ing the reconstruction model [50] cannot be the final editing
+output, as illustrated in Fig. 7.
+Using the 3D reconstruction model [50], we extract
+voxel features for both original and edited shapes by inter-
+polating their triplane features, denoted as Vi and Ve for
+the original and edited shapes, respectively. Both Vi, Ve ∈
+RA×A×A×F , where A is the voxel resolution, and F is the
+feature dimension, A = 256 and F = 40 in practice.
+To merge 3D features, we first nullify the original spe-
+cific regions Mi from the original voxel feature Vi, and
+then replace the target edited regions Me with edited fea-
+ture Ve[Me]. We write the above operations as follows:
+Vi[Mi] ← ∅, and Vi[Me] ← Ve[Me].
+(3)
+We refer to this approach as a naive copy-paste method.
+While theoretically plausible, we observe that this straight-
+forward approach typically introduces discontinuities at the
+“skull holding
+a sword”
+“… sword
+viking axe”
+ + +“tomato with fork
+behind its head”
+“… fork worm …”
+ + +“cowboy
+holding a pistol”
+“.. with a robot arm ...”
+ + +“… robot dog tail”
+“robot cat with
+robot tail”
+ + +“hogwarts castle with
+main tower in the middle”
+“main batman
+tower …”
+ + +“chicken in a
+racing car”
+“chicken cat with a
+tail in a racing car”
+ + +“sofa with no pillow”
+“no red pillow”
+ + +“pink circular
+monster”
+“… with sunglasses”
+ + +“sculpture holding
+a stone basket”
+“… stone fruit”
+Figure 5. More editing results from PrEditor3D. Our method can perform a wide range of editing on various shapes.
+Method
+Prompt Algn.
+3D Plausibility
+Texture
+Overall
+Tailor3D [31]
+98%
+99%
+99%
+99%
+MVEdit [8]
+57%
+55%
+55%
+57%
+Vox-E [36]
+53%
+68%
+50%
+55%
+Table 1. Comparison using GPTEval3D [44]. Scores indicate
+the percentage of our method being selected over baselines.
+Prompt Algn.
+Visual Quality
+Preserving Shape
+Tailor3D [31]
+96%
+97%
+99%
+MVEdit [8]
+68%
+68%
+97%
+Vox-E [36]
+78%
+88%
+89%
+Table 2.
+User study results comparing our method against
+baselines. The percentage shows the preference for our method.
+3D editing boundaries, as shown in Fig. 7. To address this,
+we propose an averaged merging approach that provides a
+more robust blend of 3D features. In the improved method,
+we dilate the 3D mask Me by a dilation d and then use an
+exclusive or (a.k.a. XOR) operation to select the boundary
+mask regions for smooth blending. Next, we linearly inter-
+polate the two voxel features Vi and Ve within the bound-
+ary regions using a coefficient θ, in practice θ = 0.5. We
+illustrate the merging process in Alg. 1. After merging, we
+generate a textured mesh from the blended voxel feature,
+using the decoders in the 3D reconstruction model [50].
+4. Experiments
+4.1. Evaluation
+Our evaluation dataset contains 18 unique shapes and 40
+editing prompts. We use shapes from GSO [14] and Obja-
+verse [12]. We evaluate our method based on the quality of
+the editing and consistency with the input shapes.
+We use the GPTEval3D [44] metric to evaluate the
+quality of edited shapes and their alignment with the text
+prompts. GPT-4V is provided with multi-view renderings of
+two methods at a time, and instructed to pick one based on
+text-prompt alignment, 3D plausibility, and texture details.
+There were 120 total questions, each answered 3 times.
+Since the GPTEval3D metric does not consider the in-
+put shape, it cannot measure whether the shape remained
+intact, and whether the edited shape is consistent with the
+high-level style of the input shape. Therefore, we adopt
+the directional CLIP [33] score metric, CLIPdir from pre-
+vious works [15, 36]. CLIPdir evaluates the average dif-
+ference between text feature change direction and image
+feature change direction, where images are multi-view ren-
+derings of input and edited shapes. To ensure our evalua-
+tion is not affected by a particular implementation of this
+metric, we introduce three variants. CLIPdir-cos replaces the
+text-image direction vector difference with cosine distance,
+while CLIPdir-avg and CLIPdir-avg-cos compute the same met-
+rics by averaging image vectors first rather than scores.
+We finally introduce two additional directional metrics.
+Method
+CLIPdir ↑
+CLIPdir-cos ↑
+CLIPdir-avg ↑
+CLIPdir-avg-cos ↑
+CLIPdiff-edit ↓
+CLIPdiff-noedit ↓
+Tailor3D [31]
+1.416
+3.710
+1.417
+4.593
+12.050
+5.619
+MVEdit [8]
+0.782
+2.578
+0.783
+3.164
+10.014
+4.118
+Vox-E [36]
+1.622
+6.178
+1.621
+8.153
+10.528
+3.432
+PrEditor3D (Ours)
+1.782
+7.679
+1.782
+11.554
+8.812
+2.636
+Table 3. Directional CLIP score metrics [36] for evaluating editing fidelity and prompt consistency. Our method outperforms baselines
+across all directional CLIP metrics. Metrics are scaled by 100 to ease reading and allow for more precision.
+Method
+Editing
+Merging
+Total
+Tailor3D [31]
+26 sec
+-
+26 sec
+MVEdit [8]
+6 min
+-
+6 min
+Vox-E[36]
+60 min
+15 min
+75 min
+PrEditor3D (Ours)
+24 sec
+50 sec
+74 sec
+Table 4. Runtime comparison. We measure the runtime of our
+baseline methods.
+CLIPdiff-edit is the CLIP score difference between input and
+output image-text pairs concerning only the edited part of
+the input and output text prompts. CLIPdiff-noedit is the CLIP
+score difference between input and output using a fixed text
+prompt where the edited part of the input text is replaced
+with a generic word, i.e., “object.” These metrics enforce
+that the CLIP text matching scores are preserved between
+the input and edited shapes, both for edited and unedited
+parts of the text.
+We report all directional metrics multiplied by 100 for
+higher precision. We refer readers to our supplementary
+material for further details about the evaluation metrics.
+4.2. Results
+Our method can generate various impressive edited shapes
+from complex input shapes and prompts. We illustrate a
+variety of our results in Fig. 5. Our approach flexibly edits
+various different elements of the 3D objects, for instance re-
+placing a “sword” of a skull warrior with a “viking axe,” re-
+sulting in coherent, seamless edits in both texture and geom-
+etry. Our edits also follow the structure of the input shape
+when applicable; for instance, when replacing a curvy fork
+with a worm, the worm maintains the same curved structure
+as the initial fork. We can also insert new objects, such as
+a “pillow” or “sunglasses.” Our method even enables both
+replacement and addition at the same time as in “cat with a
+tail” example, replacing the chicken with a cat and simulta-
+neously placing a tail at the back of the car.
+Given the same prompt, our method can generate differ-
+ent results with different seeds. In Fig. 6, we show gen-
+erations of “cat” and “dog” samples with various seeds.
+The resulting shapes vary in their head, eye, and ear struc-
+tures with different colors and sizes. We can further con-
+trol the various aspects of the generated shape through user
+prompts. This is also shown in Fig. 6, where we adjust the
+mood and appearance of the generated shape.
+ + + + + + +“chicken cat in a racing car”
+“chicken dog in a racing car”
+ + + +“happy
+ginger cat …”
+“sad
+ginger cat …”
+“ginger cat
+with sunglasses …”
+Figure 6. Multiple generations and detailed control through
+prompt. Our method can generate different results for the same
+prompt using different seeds. Moreover, Our method can handle
+detailed prompts that can modify various aspects of the shape such
+as appearance and mood. For instance, here we can define the type
+of the cat (e.g. ginger cat) and the mood (e.g., happy).
+Comparison with Baselines. Fig. 4 shows a qualitative
+comparison of our method against several state-of-the-art
+methods: Tailor3D [31], MVEdit [8] and Vox-E [36]. Our
+approach shows significant improvements, in both editing
+quality as well as consistency with the original shape.
+Similar to our method, Vox-E allows controllable edit-
+ing through merging but at the expense of an expensive
+SDS-based optimization that can tend towards more global
+changes than local ones.
+Since Tailor3D accepts edited
+front and back views as input, we ran their method using
+our multi-view editing results. Tab. 1 shows a comparison
+using GPTEval3D. While our method is consistently pre-
+ferred, improvements are not as large since this metric does
+not measure consistency with the input. A method could
+globally change the shape and still achieve better results.
+This is because this metric does not take input shape into ac-
+Input
+w/o merg�ng
+w/o avg merg�ng
+Ours
+Figure 7. Qualitative ablation of our merging algorithm. We
+can keep the original parts of the input fixed. Here when we insert
+a cat, the editing breaks the neighboring regions. Thanks to our
+merging algorithm, we can recover the original parts of the shape.
+Method
+Chamfer Distance ↓
+w/o Merging
+2.95
+w/o Average Merging
+2.29
+Ours
+2.28
+Table 5. Quantitative ablation study of our merging algorithm.
+We calculate the chamfer distance to the input shape for each abla-
+tion. Chamfer Distance value is multiplied by 103. Our algorithm
+is effective in keeping the edited shape consistent with the input.
+count and only considers edited output and the prompt. To
+complement this metric, we calculated the directional CLIP
+score and its few variants in Tab. 3; this considers consis-
+tency with the input, and demonstrates that our approach
+achieves significant improvements over the baselines.
+Perceptual Study. We prepare a perceptual study to com-
+pare our method with three other baselines asking users
+three different questions: “Select the one that follows the
+following prompt more closely”, “Select the one with better
+visual quality”, “Which example better preserves the parts
+that were not instructed to be edited with the prompt?”.
+There are 360 questions in total, each answered by 10 dif-
+ferent participants, totaling 3600 responses. Results are pre-
+sented in Tab. 2. In all questions, our method is preferred
+over the baselines.
+Runtime Analysis. Our method enables fast iteration, tak-
+ing around 24 seconds to obtain initial multi-view editing
+results. Merging then takes another 50 seconds to produce
+a final refined shape. Tab. 4 shows a comparison with base-
+lines, using a single RTX 3090 for measurements, except for
+MVDream, which we run on RTX A6000. Tailor3D [31],
+concurrent to our work, also operates fast, taking 2 seconds
+for a forward pass using our multi-view editing results as in-
+put (26 seconds in total). MVEdit [8] does not employ any
+merging, performing editing in around 6 minutes. Since
+Vox-E [36] involves a long SDS optimization process, its
+overall inference can take around an hour.
+4.3. Ablations
+Tab. 5 and Fig. 7 ablate our merging approach, measuring
+the chamfer distance between the edited and input shapes.
+Shape Preservation through Merging. Our merging al-
+gorithm ensures that only the regions described by the user
+through a mask and prompt changes. In this ablation study,
+we only do multi-view editing, and leave out the merging
+operation. As shown in Fig. 7, without any merging, regions
+that are not intended by the user can change. In the “chicken
+in a racing car” example, when the user replaces the chicken
+with a cat, some part of the car is also altered since the user
+mask covers that area. In our merging step, we detect the
+changed region (“cat”) and erased region (“chicken”) so that
+we keep the rest of the shape (“car”) fixed.
+Average Merging. After edited regions are detected, we
+merge the voxel grids of the input and edited reconstructions
+to preserve consistency with the input. In contrast, Vox-
+E [36] uses copy-pasting for merging. That is, they copy
+the detected part from the edited shape and paste it into the
+original shape. However, a simple copy-paste approach can
+create boundary artifacts such as gaps between the edited
+region and the original shape, as shown in Fig. 7. To fix
+these boundary problems, we dilate the masks. Within the
+dilated region, we take the average of the edited shape and
+the original shape, which provides a smoother transition.
+Limitations.
+Although our method can generate high-
+quality editing results, we are limited by the 256x256 reso-
+lution of the multi-view diffusion model, MVDream [37].
+In addition, our method currently focuses on 3D assets
+that can be rendered from four inward-facing views. How-
+ever, this assumption cannot effectively capture large-scale
+scenes, such as indoor rooms where more views within the
+scene are needed.
+5. Conclusion
+We propose a fast and controllable 3D editing method
+that can handle a wide variety of 3D shapes and editing
+prompts.
+We employ the editing strength of powerful
+multi-view models, lift edits to 3D, and merge edits in 3D
+in order to ensure unedited regions remain consistent with
+the input shape. Hence, our method produces high-quality
+editing results with fast runtime speeds. We believe this
+shows a significant potential for high-quality, controllable,
+seamless, and fast 3D editing.
+Acknowledgments This work is partially done during
+Ziya’s and Can’s internships at Snap. Matthias Nießner was
+supported by the ERC Starting Grant Scan2CAD (804724)
+and Angela Dai was supported by the ERC Starting Grant
+SpatialSem (101076253).
+References
+[1] Panos Achlioptas, Ian Huang, Minhyuk Sung, Sergey
+Tulyakov, and Leonidas Guibas.
+Shapetalk: A language
+dataset and framework for 3d shape edits and deformations.
+In CVPR, 2023. 3
+[2] Antonio Alliegro, Yawar Siddiqui, Tatiana Tommasi, and
+Matthias Nießner. Polydiff: Generating 3d polygonal meshes
+with diffusion models.
+arXiv preprint arXiv:2312.11417,
+2023. 2
+[3] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended
+diffusion for text-driven editing of natural images. In CVPR,
+pages 18208–18218, 2022. 3
+[4] Manuel Brack, Felix Friedrich, Katharia Kornmeier, Linoy
+Tsaban,
+Patrick Schramowski,
+Kristian Kersting,
+and
+Apolin´ario Passos. Ledits++: Limitless image editing using
+text-to-image models. In CVPR, 2024. 3
+[5] Tim Brooks, Aleksander Holynski, and Alexei A Efros. In-
+structpix2pix: Learning to follow image editing instructions.
+In CVPR, 2023. 3
+[6] Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi-
+aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu-
+tual self-attention control for consistent image synthesis and
+editing. In ICCV, 2023. 3
+[7] Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and
+Daniel Cohen-Or.
+Attend-and-excite: Attention-based se-
+mantic guidance for text-to-image diffusion models. ACM
+TOG, 2023. 3
+[8] Hansheng Chen, Ruoxi Shi, Yulin Liu, Bokui Shen, Ji-
+ayuan Gu, Gordon Wetzstein, Hao Su, and Leonidas Guibas.
+Generic 3d diffusion adapter using controlled multi-view
+editing. arXiv preprint arXiv:2403.12032, 2024. 2, 3, 5,
+6, 7, 8
+[9] Minghao Chen, Junyu Xie, Iro Laina, and Andrea Vedaldi.
+Shap-editor: Instruction-guided latent 3d editing in seconds.
+In CVPR, 2024. 2, 3
+[10] Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fan-
+tasia3d: Disentangling geometry and appearance for high-
+quality text-to-3d content creation. In ICCV, 2023. 3
+[11] Dale Decatur, Itai Lang, Kfir Aberman, and Rana Hanocka.
+3d paintbrush: Local stylization of 3d shapes with cascaded
+score distillation. In CVPR, 2024. 3
+[12] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs,
+Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana
+Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse:
+A universe of annotated 3d objects. In CVPR, 2023. 3, 6
+[13] Shaocong Dong, Lihe Ding, Zhanpeng Huang, Zibin Wang,
+Tianfan Xue, and Dan Xu. Interactive3d: Create what you
+want by interactive 3d generation. In CVPR, 2024. 2, 3
+[14] Laura Downs, Anthony Francis, Nate Koenig, Brandon Kin-
+man, Ryan Hickman, Krista Reymann, Thomas B McHugh,
+and Vincent Vanhoucke. Google scanned objects: A high-
+quality dataset of 3d scanned household items.
+In ICRA,
+2022. 6
+[15] Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano,
+Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip-
+guided domain adaptation of image generators. ACM TOG,
+2022. 6, 1
+[16] Ayaan Haque, Matthew Tancik, Alexei A Efros, Aleksander
+Holynski, and Angjoo Kanazawa. Instruct-nerf2nerf: Edit-
+ing 3d scenes with instructions. In ICCV, 2023. 2, 3
+[17] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman,
+Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image
+editing with cross attention control. In ICLR, 2022. 2, 3, 4
+[18] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
+sion probabilistic models. In NeurIPS, 2020. 3
+[19] Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou,
+Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao
+Tan. Lrm: Large reconstruction model for single image to
+3d. In ICLR, 2024. 3
+[20] Inbar Huberman-Spiegelglas, Vladimir Kulikov, and Tomer
+Michaeli. An edit friendly ddpm noise space: Inversion and
+manipulations. In CVPR, 2024. 2, 3, 4, 1
+[21] Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun
+Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg
+Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with
+sparse-view generation and large reconstruction model. In
+ICLR, 2024. 3
+[22] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa,
+Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler,
+Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution
+text-to-3d content creation. In CVPR, 2023. 2, 3
+[23] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao
+Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang,
+Hang Su, et al.
+Grounding dino:
+Marrying dino with
+grounded pre-training for open-set object detection. arXiv
+preprint arXiv:2303.05499, 2023. 2, 4, 1
+[24] Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu,
+Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang,
+Marc Habermann, Christian Theobalt, et al. Wonder3d: Sin-
+gle image to 3d using cross-domain diffusion.
+In CVPR,
+2023. 3
+[25] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia-
+jun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided
+image synthesis and editing with stochastic differential equa-
+tions. In ICLR, 2022. 3
+[26] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik,
+Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:
+Representing scenes as neural radiance fields for view syn-
+thesis. In ECCV, 2020. 3
+[27] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and
+Daniel Cohen-Or. Null-text inversion for editing real images
+using guided diffusion models. In CVPR, 2023. 3
+[28] Bharath Raj Nagoor Kani, Hsin-Ying Lee, Sergey Tulyakov,
+and Shubham Tulsiani. Upfusion: Novel view diffusion from
+unposed sparse view observations. In ECCV, 2025. 3
+[29] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav
+Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and
+Mark Chen. Glide: Towards photorealistic image generation
+and editing with text-guided diffusion models. arXiv preprint
+arXiv:2112.10741, 2021. 3
+[30] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden-
+hall. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR,
+2023. 3
+[31] Zhangyang Qi, Yunhan Yang, Mengchen Zhang, Long Xing,
+Xiaoyang Wu, Tong Wu, Dahua Lin, Xihui Liu, Jiaqi Wang,
+and Hengshuang Zhao. Tailor3d: Customized 3d assets edit-
+ing and generation with dual-side images.
+arXiv preprint
+arXiv:2407.06191, 2024. 2, 3, 5, 6, 7, 8
+[32] Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren,
+Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Sko-
+rokhodov, Peter Wonka, Sergey Tulyakov, et al. Magic123:
+One image to high-quality 3d object generation using both
+2d and 3d diffusion priors. In ICLR, 2024. 3
+[33] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
+Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
+Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
+ing transferable visual models from natural language super-
+vision. In ICML, 2021. 6, 1
+[34] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang
+Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman
+R¨adle, Chloe Rolland, Laura Gustafson, et al.
+Sam 2:
+Segment anything in images and videos.
+arXiv preprint
+arXiv:2408.00714, 2024. 2, 4, 1
+[35] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
+Patrick Esser, and Bj¨orn Ommer. High-resolution image syn-
+thesis with latent diffusion models. In CVPR, 2022. 3
+[36] Etai Sella, Gal Fiebelman, Peter Hedman, and Hadar
+Averbuch-Elor. Vox-e: Text-guided voxel editing of 3d ob-
+jects. In ICCV, 2023. 2, 3, 5, 6, 7, 8, 1
+[37] Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li,
+and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen-
+eration. In ICLR, 2024. 2, 4, 8
+[38] Yawar Siddiqui, Antonio Alliegro, Alexey Artemov, Tatiana
+Tommasi, Daniele Sirigatti, Vladislav Rosov, Angela Dai,
+and Matthias Nießner. Meshgpt: Generating triangle meshes
+with decoder-only transformers. In CVPR, 2024. 2
+[39] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-
+ing diffusion implicit models. In ICLR, 2020. 3
+[40] Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang,
+Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian
+model for high-resolution 3d content creation. arXiv preprint
+arXiv:2402.05054, 2024. 3
+[41] Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh,
+and Greg Shakhnarovich. Score jacobian chaining: Lifting
+pretrained 2d diffusion models for 3d generation. In CVPR,
+2023. 3
+[42] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan
+Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and
+diverse text-to-3d generation with variational score distilla-
+tion. In NeurIPS, 2023. 3
+[43] Zhengyi Wang, Yikai Wang, Yifei Chen, Chendong Xi-
+ang, Shuo Chen, Dajiang Yu, Chongxuan Li, Hang Su,
+and Jun Zhu.
+Crm: Single image to 3d textured mesh
+with convolutional reconstruction model.
+arXiv preprint
+arXiv:2403.05034, 2024. 3
+[44] Tong Wu, Guandao Yang, Zhibing Li, Kai Zhang, Ziwei Liu,
+Leonidas Guibas, Dahua Lin, and Gordon Wetzstein. Gpt-4v
+(ision) is a human-aligned evaluator for text-to-3d genera-
+tion. In CVPR, 2024. 2, 6
+[45] Jiale Xu, Xintao Wang, Yan-Pei Cao, Weihao Cheng, Ying
+Shan, and Shenghua Gao.
+Instructp2p: Learning to edit
+3d point clouds with text instructions.
+arXiv preprint
+arXiv:2306.07154, 2023. 3
+[46] Yinghao Xu, Hao Tan, Fujun Luan, Sai Bi, Peng Wang, Ji-
+ahao Li, Zifan Shi, Kalyan Sunkavalli, Gordon Wetzstein,
+Zexiang Xu, et al. Dmv3d: Denoising multi-view diffusion
+using 3d large reconstruction model. In ICLR, 2024. 3
+[47] Taoran Yi, Jiemin Fang, Junjie Wang, Guanjun Wu, Lingxi
+Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, and Xinggang
+Wang. Gaussiandreamer: Fast generation from text to 3d
+gaussians by bridging 2d and 3d diffusion models. In CVPR,
+2024. 2
+[48] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa.
+pixelNeRF: Neural radiance fields from one or few images.
+In CVPR, 2021. 3
+[49] Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su.
+Magicbrush: A manually annotated dataset for instruction-
+guided image editing. In NeurIPS, 2024. 3
+[50] Peiye Zhuang, Songfang Han, Chaoyang Wang, Aliak-
+sandr Siarohin, Jiaxu Zou, Michael Vasilkovsky, Vladislav
+Shakhrai, Sergey Korolev, Sergey Tulyakov, and Hsin-
+Ying Lee. Gtr: Improving large 3d reconstruction models
+through geometry and texture refinement.
+arXiv preprint
+arXiv:2406.05649, 2024. 2, 3, 5, 6, 1
+6. Appendix
+We present additional details about PrEditor3D in this ap-
+pendix. We start by explaining some of the implementa-
+tion details in Sec. 6.1. In Sec. 6.2, we discuss automatic
+masking, an alternative to user-brushed masking. Sec. 6.3
+follows this discussion with the effect of mask granularity
+on the editing process. Finally, we explain the directional
+CLIP metrics we used for baseline comparison in Sec. 6.4.
+6.1. Implementation Details
+We used the official implementation and checkpoint of MV-
+Dream as our multi-view diffusion model. It has 256 x 256
+resolution and it can generate four views by default. In all of
+our generations, we set the classifier-free guidance scale of
+the diffusion process to 10. Official DDPM inversion [20]
+implementation only handles single-image but we modified
+it to handle our four view renderings. The inversion process
+takes 9 seconds on RTX 3090. With the inverted latents,
+we ran our inference for 41 steps, which takes around 12
+seconds on an RTX 3090. For the segmentation, we cal-
+culate bounding boxes using Grounding DINO [23] for all
+views and add these as constraints to SAM 2 [34] tracking.
+That is to help SAM 2 with the segmentation, we constrain
+each frame separately. For merging and reconstruction, we
+modify GTR [50], which is a feed-forward reconstruction
+model. GTR mainly operates on triplanes but just before re-
+construction, those triplanes are converted into a voxel grid.
+We manipulated the voxel grid it generated to merge two
+different shapes.
+6.2. Automatic Masking
+In addition to user-brushed masks, we can also gener-
+ate and operate on automatically generated masks. Even
+though they limit the editing region, when compared to
+user-brushed masks; they can be practically used as a start-
+ing point for user-brushed masking.
+We leverage our segmentation approach to replace masks
+given by the user. We use an input prompt from the user
+to detect the target region using Grounding DINO [23] and
+SAM 2 [34]. This segmentation method gives us a mask
+restricted only to the sword. As a result, the generation pro-
+cess cannot go beyond that region. However, when we ac-
+cept input from user masks, user can explicitly show their
+intention with the mask and can generate a ”viking axe”, as
+shown in Fig. 9.
+We want to reiterate that although the user-brushed
+masks are too coarse and not 3D-consistent, our method can
+generate impressive results without modifying the original
+parts of the shape. That is, a quickly drawn mask is enough
+for our method to work.
+ +0
+ ++10
+ ++20
+ ++30
+ +-10
+ + + + + + + + + + + + + + + + + + + + +Mask
+Edited
+Shape
+Dilation
+Figure 8. Different granularity of masking. Too fine-grained
+masks can over-constrain the generation process since they only
+point to the region to be replaced but do not include the user’s in-
+tention. More dilation increases flexibility but can also edit more
+regions than intended (e.g., the region underneath the cat). Nega-
+tive dilation means erosion.
+6.3. Mask Granularity
+We experimented with different granularity levels for the
+input masks. We started with a mask that we detected au-
+tomatically using Grounding DINO [23] and SAM 2 [34].
+As shown in Fig. 8. If we use the original segmentation,
+then the generation is restricted to that certain region and
+the model cannot have room to add ”cat” features. That is,
+it tries to follow the shape of the original chicken. As we
+add more dilation, it tries to add features like cat ears. This
+shows the trade-off between loyalty to input and flexibility.
+Based on this observation, we gave coarse masks as input
+and allowed the model to edit flexibly. Thanks to our merg-
+ing approach, we could still combine the edited region with
+the original shape to keep the rest intact.
+6.4. Directional CLIP Metrics
+In Sec. 4.1-4.2, we discuss directional CLIP score met-
+rics [15, 33, 36] to evaluate 3D editing fidelity, to comple-
+ment other quantitative metrics that measure the quality of
+the output shape. We report directional CLIP scores of dif-
+ferent methods in Tab. 3 of the main paper. In this section,
+we formally define and discuss the reported metrics.
+CLIPdir = 1
+N
+N
+�
+i=1
+< F i
+IE − F i
+II, FT E − FT I >,
+(4)
+where < ., . > refers to an inner product, F i
+IE, F i
+II are
+the normalized CLIP image embeddings over rendered im-
+ages of input and edited shapes, indexed by i, and FT E, FT I
+are the corresponding normalized text embeddings of edited
+and input prompts. i indexes a particular frame, while N
+is the total number of rendered frames. In our directional
+CLIP evaluations, we use N = 70 views rendered over
+a 360◦ trajectory, significantly larger than the four input
+views we use for our method and the baseline methods.
+Automatically Generated Mask
+User-Brushed Mask
+ + + + + + + + + + +Figure 9. Comparing automatically generated mask to user-
+generated mask. Users may want to do specific editing such as
+replacing the “sword” with “a viking axe”. If we only rely on
+automatic masking, the result may not follow the user’s intention
+since the automatically generated mask can limit the editing to a
+certain region. However, when we rely on explicit masking, we
+can get the specific shape requested by the user.
+We also introduce additional metrics inspired by CLIPdir,
+but aim to fix some of its problems. First, we define
+CLIPdir-cos = 1
+N
+N
+�
+i=1
+C(F i
+IE − F i
+II, FT E − FT I),
+(5)
+where C(., .) is the cosine distance.
+We also introduce two modified versions of these met-
+rics, namely
+CLIPdir-avg =< 1
+N
+N
+�
+i=1
+F i
+IE − F i
+II, FT E − FT I >
+(6)
+CLIPdir-avg-cos = C( 1
+N
+N
+�
+i=1
+F i
+IE − F i
+II, FT E − FT I)
+(7)
+that compute the same metrics over the average image em-
+beddings instead of averaging scores to ensure further ro-
+bustness.
+We also propose two similarity change error metrics,
+CLIPdiff-edit and CLIPdiff-noedit
+CLIPdiff-edit = 1
+N
+N
+�
+i=1
+|C(F i
+II, FT W ) − C(F i
+IE, FT W )|rel
+(8)
+CLIPdiff-noedit = 1
+N
+N
+�
+i=1
+|C(F i
+II, FT G) − C(F i
+IE, FT G)|rel.
+(9)
+Here, |x − y|rel =
+|x−y|
+max(x,y), FT W is the text embed-
+ding of the edited word or phrase, and FT G represents the
+”generic” text. For instance, when the prompt “a chicken
+riding a bike” becomes “cat riding a bike”, FT W embeds the
+text “cat” and FT G embeds the text “object riding a bike”.
+By measuring similarity differences of rendered images to
+FT W and FT G, we aim to measure the preservation of the
+object and context semantics, respectively.
+