diff --git a/packages/metascraper-readability/README.md b/packages/metascraper-readability/README.md index f9f6af968..c2d4ccfc8 100644 --- a/packages/metascraper-readability/README.md +++ b/packages/metascraper-readability/README.md @@ -14,6 +14,19 @@ $ npm install metascraper-readability --save ``` +## API + +### metascraper-readability([options]) + +#### options + +##### getDocument + +Type: `function`
+Default: [source code](https://github.com/microlinkhq/metascraper/blob/master/packages/metascraper-readability/src/index.js#L14-L20) + +The function to be called to serialized html into a DOM document. + ## License **metascraper-readability** © [Microlink](https://microlink.io), released under the [MIT](https://github.com/microlinkhq/metascraper/blob/master/LICENSE.md) License.
diff --git a/packages/metascraper-readability/benchmark/fixture.html b/packages/metascraper-readability/benchmark/fixture.html new file mode 100644 index 000000000..abf16abe3 --- /dev/null +++ b/packages/metascraper-readability/benchmark/fixture.html @@ -0,0 +1,1345 @@ + + +2412.06592 + +

PrEditor3D: Fast and Precise 3D Shape Editing


Ziya Erkoc¸1


Can G¨umeli1


Chaoyang Wang2


Matthias Nießner1


Angela Dai1


Peter Wonka2,3


Hsin-Ying Lee2


Peiye Zhuang2


1Technical University of Munich


2Snap Inc


3King Abdullah University of Science and Technology



+ +








+ +

“chicken in a racing car”


“viking with a


mustache, helmet”


“... helmet party hat


“… red pepper


mustache …”


“… racing car jeep


chicken cat …”

+ + +



in a wooden cart”




flamethrower …”

+ +



trebuchet …”

+ + +

“truck carrying clay”


“… clay pizza






1 min


4 min


1 hr










+ + + + + +

“chicken in a


racing car”




cat with a tail …”

+ + + + + + + + +

chicken dog

+ + + + +



android chicken


wearing a tie

+ + +

“house in a forest”


house treasure


cave …”


Figure 1. PrEditor3D is a (top) fast and high-quality editing method that can perform precise and consistent editing only in the intended


regions, keeping the rest identical. (mid) It can handle diverse editing prompts with any given 3D object. (bottom) Furthermore, it can


support iterative editing, facilitating artistic workflow, and can also support editing multiple regions in a single run.




We propose a training-free approach to 3D editing that


enables the editing of a single shape within a few minutes.


The edited 3D mesh aligns well with the prompts, and re-


mains identical for regions that are not intended to be al-


tered. To this end, we first project the 3D object onto 4-view


images and perform synchronized multi-view image edit-


ing along with user-guided text prompts and user-provided


rough masks. However, the targeted regions to be edited


are ambiguous due to projection from 3D to 2D. To en-


sure precise editing only in intended regions, we develop


a 3D segmentation pipeline that detects edited areas in 3D


space, followed by a merging algorithm to seamlessly in-


tegrate edited 3D regions with the original input. Exten-


sive experiments demonstrate the superiority of our method


over previous approaches, enabling fast, high-quality edit-


ing while preserving unintended regions.


arXiv:2412.06592v1 [cs.CV] 9 Dec 2024


1. Introduction


Recent 3D diffusion models can generate high-quality as-


sets that closely align with the text prompts in the form of


neural fields [22, 37], meshes [2, 38], or Gaussian point


clouds [47]. Although these methods generate impressive


results, they lack the essential capability for precise and


controllable editing of the generated outputs, a critical re-


quirement for iterative artistic workflows. Effective 3D edit-


ing demands: (1) it should be fast enough to provide quick


feedback, ideally comparable to fast 3D generation algo-


rithms, and (2) it must allow for precise local control, en-


abling users to keep specific parts of the model unchanged.


Enabling precise and controllable editing is still an open


challenge. Several initial approaches have been proposed to


tackle the challenge of 3D editing [8, 9, 13, 16, 31, 36], pro-


viding promising results but suffering from slow runtime,


lack of precise control, and/or lack of 3D consistency and


quality. Optimization-based techniques like SDS [9, 13, 36]


or multiview training dataset updates [16] are computation-


ally expensive, making interactive editing out of reach.


Additionally, they offer limited control over specific


parts of the shape, as text prompts alone cannot precisely


localize regions to be edited [8, 9, 16, 31, 36]. While Vox-


E [36] and Shap-Editor [9] propose a mechanism to prevent


original parts of the shape from being altered during editing,


they do not enable precise editing due to having only text as


input. Finally, one can observe various visual quality prob-


lems, such as the Janus problem, blurring, over-saturation,


and overemphasizing texture changes while leaving the ge-


ometry intact or degrading.


To address these challenges, we propose a novel editing


pipeline for 3D assets that is faster, more precise, and de-


livers high-quality results (See Fig. 1). As our primary goal


is faster editing, we propose an editing framework lever-


aging a pipeline that consists of two components: a multi-


view diffusion algorithm and a feed-forward mesh recon-


struction. Multi-view diffusion models can leverage supe-


rior 2D editing techniques, and the feed-forward mesh re-


construction bridges the gap between 2D and 3D. For bet-


ter controllability, we extend multi-view image generation


to multi-view image editing using 2D masks to constrain


the edits to user-specified regions. The 2D masks can take


various forms, including manually selected regions, hand-


brushed areas, or automatically generated segmentations.


We adopt DDPM inversion [20] to extract initial noise vec-


tors from input multi-view images and execute Prompt-to-


Prompt [17] on a multi-view diffusion model [37]. We use


2D user-provided masks to blend edited and original views


during the denoising. However, due to the inherent ambigu-


ity caused by projection from 3D to 2D, we cannot obtain


ideal intended regions in 2D regardless of the granularity of


the masks, as shown in Fig. 2. However, the masks are often


too rough to precisely capture the intended semantic editing

+ + + + + + +





Limited editing regions


3D à 2D


User-Provided mask granularity


Alter unintended regions


Figure 2. Ambiguous intended regions. The intended region to


be edited is clear in 3D (e.g. the cat tail). However, after projecting


to 2D, regardless of the granularity of the user-provided masks, the


editing will either alter some unintended regions (e.g. the robot


cat) or be too limited for reasonable editing.


regions. The masks are either too coarse so the unintended


regions will be changed, or too fine-grained to allow rea-


sonable editing. Without additional spatial information in


3D, multi-view editing approaches cannot fully address this


challenge. Therefore, simply adopting a feed-forward re-


construction method [50] to convert edited multi-views into


a 3D mesh often leads to undesirable results.


To tackle this issue, we propose using the original 3D in-


put and 3D segmentation. We first detect the intended edit-


ing region using Grounding DINO [23] and SAM 2 [34]


with the user mask and prompts. This gives an initial 2D


segmentation that we subsequently lift to 3D. For this pur-


pose, we use color coding in a multi-view to 3D reconstruc-


tion pipeline, named GTR [50], to end up with a 3D seg-


mentation that we can use during merging. Specifically, we


paint 2D segmentations in a specific color, (e.g., green) and


after reconstruction, we can detect which regions are edited


by querying the color in the 3D field.


Then, we perform merging to maintain the original parts


of the shape. We use 3D masks to detect edited/replaced


parts in GTR’s [50] voxel-feature space. We extract the


edited part from the new shape and replace it with the old


part from the original shape. That way, we guarantee that


the remaining shape will remain identical. We apply a final


average blending operation so that new parts and the origi-


nal shape blend smoothly.


In summary, we make the following contributions:


• We propose a novel method for diffusion-based 3D ob-


ject editing that is faster than previous work and enables


precise, interactive editing.


• The proposed method consists of user-guided and multi-


view-synchronized editing and a feed-forward 3D recon-


struction, enabling fast editing in a feed-forward manner.


• To enable precise editing only to intended regions, we


propose a voxel-based 3D segmentation method that uti-


lizes multi-view segmentation information and propa-


gates it to 3D, followed by an average blending operation


to merge the edited and original objects.


• Our editing method has superior quality to previous work.


We show significant improvements in GPTEval3D [44],


directional CLIP metrics, and extensive user studies.

+ + + + + + + + + + + + + + +

Orgnal Voxel Feature


Edted Voxel Feature


Edted Mult-Vew




happy gnger cat ...”




a racng car”

+ + + + + + + + + + + + + + + + + + + + + + +

Rendered Mult-Vew


User-Provded Mask

+ + +

Groundng + SAM 2

+ +

3D Reconstructon

+ +

Mergng & Renderng


Edted 3D


Input 3D

+ +


+ +


+ +


+ +


+ +



Figure 3. Overview of PrEditor3D. Given an input 3D object, we first render its multi-view images from 4 orthogonal views. We then


obtain editing input from the user, describing in text as well as rough 2D masks the desired edits. We perform synchronized multi-view


editing based on the text prompts as well as the user-provided masks (Sec.3.1. Due to the rough masks and the unclear intended regions


caused by ambiguous 3D-2D projection, we detect the intended regions with Grounding Dino and SAM 2 (Sec.3.2, where the segmentation


results are lifted to 3D for the final merging operation (Sec.3.3).


2. Related Works


2D Editing applies global or local modifications to an im-


age based on user instructions.


It has gained significant


attention for enabling a more interactive user experience


in content creation. To achieve this, existing methods ei-


ther fine-tune text-to-image models with specialized in-


structional editing datasets [45, 49], or use training-free ap-


proaches with inpainting [3, 29] or cross-attention mech-


anisms [6, 7, 17]. To edit user-provided images, various


diffusion inversion techniques have been developed. For


example, SDEdit [25] introduces noise to images to capture


intermediate steps in the diffusion process, null-text inver-


sion [27] inverts the deterministic DDIM inversion [39], and


recent works [4, 20] use DDPM inversion [18] to enhance


editing capabilities.


In this work, one step to achieve 3D editing is to per-


form 2D multi-view image editing. We apply the DDPM


inversion [18] and Prompt2Prompt [17] to operate within a


diffusion-based sparse multi-view generation model.


3D Reconstruction from Sparse Multi-Views refers to the


task of reconstructing a 3D instance from a limited number


of multi-view images. To achieve this, Score Distillation


Sampling (SDS) [30] and its variants [10, 22, 32, 41, 42]


optimize a 3D scene representation by reconstructing the


given sparse views and generating novel views through gra-


dients from large-scale pre-trained text-to-image diffusion


models [35]. However, these approaches are often com-


putationally demanding. Beyond per-scene optimization,


PixelNeRF [48] generates a NeRF representation [26] of


a scene from sparse-view images by training the model


across multiple scenes. With recent advancements in large-


scale 3D datasets such as Objaverse [12], follow-up works


[19, 21, 24, 28, 40, 43, 46, 50] have improved this feed-


forward 3D reconstruction approach for object-centric 3D


assets. In this work, we utilize one recent method, GTR


[50], to quickly generate 3D meshes from 4 views.


3D Editing presents additional challenges compared to 2D


editing due to the need to maintain spatial consistency in


3D. To address this, some methods train 3D editing models


using paired 3D datasets [1, 45], however, these approaches


are limited by a lack of diverse and complex datasets. Other


methods adapt 2D editing models, such as InstructPix2Pix


[5], for the 3D domain [16, 31], by iteratively updating


multi-view images of a scene. However, without synchro-


nized multi-view updates, this approach often results in


flickering or inconsistent views. Alternatively, some meth-


ods propose a generation-reconstruction loop to modify 3D


representations using intermediate denoised images [8] or


SDS gradients [9, 11, 13, 36] in the diffusion process. While


these methods can achieve 3D consistency, they often strug-


gle with quality or suffer from high computational costs. In


our work, we perform multi-view editing and reconstruct


the edited object in 3D. Beyond that, to preserve unchanged


regions, we carefully design an approach by detecting the


edited 3D regions and integrating the intended 3D edited


regions into the original shapes.


3. Method


We aim to achieve fast 3D asset editing in a training-free


manner, allowing for precise and user-guided edits. Our


approach achieves this through multi-view image editing


in 2D, followed by lifting the 2D edits into 3D. This pro-


cess can be summarized into 3 main steps: (1) synchronized


sparse multi-view editing in 2D (Sec.3.1), (2) detecting in-


tended editing regions across 2D views through the Ground-


ing Dino [23] and SAM 2 [34] approach (Sec.3.2), and (3)


lifting the intended editing regions to 3D and merging the


edited shape into the original (Sec.3.3). Our approach is


illustrated in Fig. 3.


3.1. Synchronized Sparse Multi-View Editing


To edit a given 3D object O, we leverage the power of


2D editing through multi-view diffusion models. We first


perform synchronized sparse multi-view edits using a pre-


trained multi-view diffusion model. In practice, we use MV-


Dream [37], which generates 4 orthogonal views based on a


text prompt. Note that our approach remains agnostic to the


specific diffusion-based multi-view generation model used.


We first render multi-view images from O, then apply


the DDPM diffusion inversion mechanism [20] to revert the


multi-view images to their initial noise vectors, denoted as


xT , where T is the number of diffusion timesteps during


the denoising process. We will use these vectors xT as the


initial latent vectors for the editing process in diffusion. We


denote the text prompt for the input shape and the edited


shape as yi and ye, respectively. Also, to enable better edit-


ing control and interaction, we further take user-provided


masks in 4 views as input. These masks indicate target re-


gions for the edit, denoted as Muser R4×H×W where H


and W are the height and weight of the images. Note that


our method does not require precise and accurate masks.


To edit multi-view images, we apply the Prompt-to-


Prompt [17] approach on the multi-view diffusion model.


For simplicity, we present the basic operation at each diffu-


sion timestep and within each attention block. To be spe-


cific, Prompt-to-Prompt [17] generates an edited latent vec-


tor x


e by replacing the self- and cross-attention weights of


the original latent vector, xi, with the edited prompt ye. To


confine modifications to the desired regions, we blend the


latent vectors x


e and xi using user-provided masks Muser.


We denote the final edited latent vector at each step as xe:


xe Muser · x


e + (1 Muser) · xi.




In practice, we downsample the user-provided masks Muser


to match the feature resolution at each model layer, and the


edited latent vector, xe, serves as the input to the next model




Using the inverted noise vector xT ensures that the


edited results align with the original texture style. How-


ever, the masks Muser are often imprecise or misaligned


with the target regions, which can still affect regions that are


not intended to be altered. Furthermore, occlusions along


the depth dimension introduce additional challenges in ac-


curately and semantically localizing the intended editing re-


gions in 3D. We illustrate the issue in Fig. 2. To address this,


we apply an automated grounding approach that detects the


intended editing region in both 2D and 3D.


Algorithm 1: Merging Voxel Features


Input: Vi and Ve RA×A×A×F , Mi and Me


RA×A×A, d N, and θ [0, 1]


Output: Vblend RA×A×A×F


1 Vi[Mi] ← ∅


2 Vi[Me] Ve[Me]


3 N Dilation(Me, d)


4 K Me N // means XOR


5 Vblend Vi


6 Vblend[K] θ Vi[K] + (1 θ) Ve[K]


7 return Vblend


3.2. Detection of Intended Editing Regions in 2D


The intended editing regions refer to the specific semantic


areas that correspond to the editing prompt. For example, in


Fig. 1, changing “chicken” to “cat” implies that the desired


editing region pertains only to the areas representing the


“chicken” and “cat” concepts. Regions (both 2D and 3D)


that are not semantically related to these concepts should


remain unchanged.


To ensure that only intended editing regions are edited


while allowing rough user-provided masks, we propose to


detect the intended editing regions by applying Grounding


Dino [23] and SAM 2 [34] to both the original and the


edited multi-view images. To begin with, we identify the


changing concept by comparing the original prompt yi and


the editing prompt ye. We then localize the changing con-


cept within the user-provided mask regions Muser and ob-


tain corresponding bounding boxes in multiple views. For-


mally, we write the procedure as


bboxx Grounding(x, Muser, yi, ye),




where bboxx are the bounding boxes for the changing con-


cept in the multi-view images x ∈ {xi, xe}.1


Afterward, we segment and track the changing ele-


ments across views using the grounding bounding boxes




Formally, this procedure can be described as


SAM(x, bboxx). This process yields the intended editing


regions in segmentation format for both the original and


edited multi-views.


3.3. Lifting and Merging Edits in 3D


Finally, we lift the 2D edits to 3D and merge the 3D-edited


regions into the original shape.


3D Segmentation by Lifting 2D Segmentations. We mark


the intended editing regions on multi-view images using


a green color.


The color-coded multi-view images are


then reconstructed in 3D using an offline 3D reconstruction


1We use x to represent both latent vectors and images, without distin-


guishing between the two.




Tailor3D [31]


MVEdit [8]


Vox-E [36]



+ + + + + + + + + + + + + + + + + + + + + + + + + +

chicken dog


clay pizza


Oreo pizza


castle skyscraper … mustache red pepper


Figure 4. Qualitative comparison. Our method can perform diverse editing samples and only edit the intended regions.


model [50]. This model takes multiple views as input and


represents the shape as a triplane feature. Two separate de-


coders—one for Signed Distance Function (SDF) and one


for color— generate a geometry field and a color field, re-


spectively. Through this process, we create a 3D segmen-


tation field from the color-coded multi-view images. For


each 3D position within the 3D space represented by the


triplane, we determine whether the position lies within the


3D intended editing regions based on its color value. That


is, we apply a decision threshold for the distance between


each color value and the preset green color to identify tar-


geted regions. This produces two 3D masks, indicating the


intended editing regions in 3D for both the original shape


and the edited shape, denoted as Mi and Me, respectively.


Merging Edits in 3D. As aforementioned, coarse user-


provided masks and occlusion issues in 2D projections of-


ten result in unwanted alterations, compromising the preser-


vation of unaffected 3D regions. Therefore, the directly re-


constructed shapes from the edited multi-view images us-


ing the reconstruction model [50] cannot be the final editing


output, as illustrated in Fig. 7.


Using the 3D reconstruction model [50], we extract


voxel features for both original and edited shapes by inter-


polating their triplane features, denoted as Vi and Ve for


the original and edited shapes, respectively. Both Vi, Ve


RA×A×A×F , where A is the voxel resolution, and F is the


feature dimension, A = 256 and F = 40 in practice.


To merge 3D features, we first nullify the original spe-


cific regions Mi from the original voxel feature Vi, and


then replace the target edited regions Me with edited fea-


ture Ve[Me]. We write the above operations as follows:


Vi[Mi] ← ∅, and Vi[Me] Ve[Me].




We refer to this approach as a naive copy-paste method.


While theoretically plausible, we observe that this straight-


forward approach typically introduces discontinuities at the

+ + +

“skull holding


a sword”


“… sword


viking axe”

+ + +

“tomato with fork


behind its head”


“… fork worm …”

+ + +



holding a pistol”


.. with a robot arm ...

+ + +

“… robot dog tail”


“robot cat with


robot tail”

+ + +

“hogwarts castle with


main tower in the middle”


“main batman


tower …”

+ + +

“chicken in a


racing car”


chicken cat with a


tail in a racing car”

+ + +

“sofa with no pillow”


no red pillow”

+ + +

“pink circular




“… with sunglasses

+ + +

“sculpture holding


a stone basket”


“… stone fruit


Figure 5. More editing results from PrEditor3D. Our method can perform a wide range of editing on various shapes.




Prompt Algn.


3D Plausibility






Tailor3D [31]










MVEdit [8]










Vox-E [36]










Table 1. Comparison using GPTEval3D [44]. Scores indicate


the percentage of our method being selected over baselines.


Prompt Algn.


Visual Quality


Preserving Shape


Tailor3D [31]








MVEdit [8]








Vox-E [36]








Table 2.


User study results comparing our method against


baselines. The percentage shows the preference for our method.


3D editing boundaries, as shown in Fig. 7. To address this,


we propose an averaged merging approach that provides a


more robust blend of 3D features. In the improved method,


we dilate the 3D mask Me by a dilation d and then use an


exclusive or (a.k.a. XOR) operation to select the boundary


mask regions for smooth blending. Next, we linearly inter-


polate the two voxel features Vi and Ve within the bound-


ary regions using a coefficient θ, in practice θ = 0.5. We


illustrate the merging process in Alg. 1. After merging, we


generate a textured mesh from the blended voxel feature,


using the decoders in the 3D reconstruction model [50].


4. Experiments


4.1. Evaluation


Our evaluation dataset contains 18 unique shapes and 40


editing prompts. We use shapes from GSO [14] and Obja-


verse [12]. We evaluate our method based on the quality of


the editing and consistency with the input shapes.


We use the GPTEval3D [44] metric to evaluate the


quality of edited shapes and their alignment with the text


prompts. GPT-4V is provided with multi-view renderings of


two methods at a time, and instructed to pick one based on


text-prompt alignment, 3D plausibility, and texture details.


There were 120 total questions, each answered 3 times.


Since the GPTEval3D metric does not consider the in-


put shape, it cannot measure whether the shape remained


intact, and whether the edited shape is consistent with the


high-level style of the input shape. Therefore, we adopt


the directional CLIP [33] score metric, CLIPdir from pre-


vious works [15, 36]. CLIPdir evaluates the average dif-


ference between text feature change direction and image


feature change direction, where images are multi-view ren-


derings of input and edited shapes. To ensure our evalua-


tion is not affected by a particular implementation of this


metric, we introduce three variants. CLIPdir-cos replaces the


text-image direction vector difference with cosine distance,


while CLIPdir-avg and CLIPdir-avg-cos compute the same met-


rics by averaging image vectors first rather than scores.


We finally introduce two additional directional metrics.
















Tailor3D [31]














MVEdit [8]














Vox-E [36]














PrEditor3D (Ours)














Table 3. Directional CLIP score metrics [36] for evaluating editing fidelity and prompt consistency. Our method outperforms baselines


across all directional CLIP metrics. Metrics are scaled by 100 to ease reading and allow for more precision.










Tailor3D [31]


26 sec




26 sec


MVEdit [8]


6 min




6 min




60 min


15 min


75 min


PrEditor3D (Ours)


24 sec


50 sec


74 sec


Table 4. Runtime comparison. We measure the runtime of our


baseline methods.


CLIPdiff-edit is the CLIP score difference between input and


output image-text pairs concerning only the edited part of


the input and output text prompts. CLIPdiff-noedit is the CLIP


score difference between input and output using a fixed text


prompt where the edited part of the input text is replaced


with a generic word, i.e., “object.” These metrics enforce


that the CLIP text matching scores are preserved between


the input and edited shapes, both for edited and unedited


parts of the text.


We report all directional metrics multiplied by 100 for


higher precision. We refer readers to our supplementary


material for further details about the evaluation metrics.


4.2. Results


Our method can generate various impressive edited shapes


from complex input shapes and prompts. We illustrate a


variety of our results in Fig. 5. Our approach flexibly edits


various different elements of the 3D objects, for instance re-


placing a “sword” of a skull warrior with a “viking axe,” re-


sulting in coherent, seamless edits in both texture and geom-


etry. Our edits also follow the structure of the input shape


when applicable; for instance, when replacing a curvy fork


with a worm, the worm maintains the same curved structure


as the initial fork. We can also insert new objects, such as


a “pillow” or “sunglasses.” Our method even enables both


replacement and addition at the same time as in “cat with a


tail” example, replacing the chicken with a cat and simulta-


neously placing a tail at the back of the car.


Given the same prompt, our method can generate differ-


ent results with different seeds. In Fig. 6, we show gen-


erations of “cat” and “dog” samples with various seeds.


The resulting shapes vary in their head, eye, and ear struc-


tures with different colors and sizes. We can further con-


trol the various aspects of the generated shape through user


prompts. This is also shown in Fig. 6, where we adjust the


mood and appearance of the generated shape.

+ + + + + + +

“chicken cat in a racing car”


“chicken dog in a racing car”

+ + + +



ginger cat …”




ginger cat …”


ginger cat


with sunglasses …”


Figure 6. Multiple generations and detailed control through


prompt. Our method can generate different results for the same


prompt using different seeds. Moreover, Our method can handle


detailed prompts that can modify various aspects of the shape such


as appearance and mood. For instance, here we can define the type


of the cat (e.g. ginger cat) and the mood (e.g., happy).


Comparison with Baselines. Fig. 4 shows a qualitative


comparison of our method against several state-of-the-art


methods: Tailor3D [31], MVEdit [8] and Vox-E [36]. Our


approach shows significant improvements, in both editing


quality as well as consistency with the original shape.


Similar to our method, Vox-E allows controllable edit-


ing through merging but at the expense of an expensive


SDS-based optimization that can tend towards more global


changes than local ones.


Since Tailor3D accepts edited


front and back views as input, we ran their method using


our multi-view editing results. Tab. 1 shows a comparison


using GPTEval3D. While our method is consistently pre-


ferred, improvements are not as large since this metric does


not measure consistency with the input. A method could


globally change the shape and still achieve better results.


This is because this metric does not take input shape into ac-

+ + + + + + + + +



w/o mergng


w/o avg mergng




Figure 7. Qualitative ablation of our merging algorithm. We


can keep the original parts of the input fixed. Here when we insert


a cat, the editing breaks the neighboring regions. Thanks to our


merging algorithm, we can recover the original parts of the shape.




Chamfer Distance


w/o Merging




w/o Average Merging








Table 5. Quantitative ablation study of our merging algorithm.


We calculate the chamfer distance to the input shape for each abla-


tion. Chamfer Distance value is multiplied by 103. Our algorithm


is effective in keeping the edited shape consistent with the input.


count and only considers edited output and the prompt. To


complement this metric, we calculated the directional CLIP


score and its few variants in Tab. 3; this considers consis-


tency with the input, and demonstrates that our approach


achieves significant improvements over the baselines.


Perceptual Study. We prepare a perceptual study to com-


pare our method with three other baselines asking users


three different questions: “Select the one that follows the


following prompt more closely”, “Select the one with better


visual quality”, “Which example better preserves the parts


that were not instructed to be edited with the prompt?”.


There are 360 questions in total, each answered by 10 dif-


ferent participants, totaling 3600 responses. Results are pre-


sented in Tab. 2. In all questions, our method is preferred


over the baselines.


Runtime Analysis. Our method enables fast iteration, tak-


ing around 24 seconds to obtain initial multi-view editing


results. Merging then takes another 50 seconds to produce


a final refined shape. Tab. 4 shows a comparison with base-


lines, using a single RTX 3090 for measurements, except for


MVDream, which we run on RTX A6000. Tailor3D [31],


concurrent to our work, also operates fast, taking 2 seconds


for a forward pass using our multi-view editing results as in-


put (26 seconds in total). MVEdit [8] does not employ any


merging, performing editing in around 6 minutes. Since


Vox-E [36] involves a long SDS optimization process, its


overall inference can take around an hour.


4.3. Ablations


Tab. 5 and Fig. 7 ablate our merging approach, measuring


the chamfer distance between the edited and input shapes.


Shape Preservation through Merging. Our merging al-


gorithm ensures that only the regions described by the user


through a mask and prompt changes. In this ablation study,


we only do multi-view editing, and leave out the merging


operation. As shown in Fig. 7, without any merging, regions


that are not intended by the user can change. In the “chicken


in a racing car” example, when the user replaces the chicken


with a cat, some part of the car is also altered since the user


mask covers that area. In our merging step, we detect the


changed region (“cat”) and erased region (“chicken”) so that


we keep the rest of the shape (“car”) fixed.


Average Merging. After edited regions are detected, we


merge the voxel grids of the input and edited reconstructions


to preserve consistency with the input. In contrast, Vox-


E [36] uses copy-pasting for merging. That is, they copy


the detected part from the edited shape and paste it into the


original shape. However, a simple copy-paste approach can


create boundary artifacts such as gaps between the edited


region and the original shape, as shown in Fig. 7. To fix


these boundary problems, we dilate the masks. Within the


dilated region, we take the average of the edited shape and


the original shape, which provides a smoother transition.




Although our method can generate high-


quality editing results, we are limited by the 256x256 reso-


lution of the multi-view diffusion model, MVDream [37].


In addition, our method currently focuses on 3D assets


that can be rendered from four inward-facing views. How-


ever, this assumption cannot effectively capture large-scale


scenes, such as indoor rooms where more views within the


scene are needed.


5. Conclusion


We propose a fast and controllable 3D editing method


that can handle a wide variety of 3D shapes and editing




We employ the editing strength of powerful


multi-view models, lift edits to 3D, and merge edits in 3D


in order to ensure unedited regions remain consistent with


the input shape. Hence, our method produces high-quality


editing results with fast runtime speeds. We believe this


shows a significant potential for high-quality, controllable,


seamless, and fast 3D editing.


Acknowledgments This work is partially done during


Ziya’s and Can’s internships at Snap. Matthias Nießner was


supported by the ERC Starting Grant Scan2CAD (804724)


and Angela Dai was supported by the ERC Starting Grant


SpatialSem (101076253).




[1] Panos Achlioptas, Ian Huang, Minhyuk Sung, Sergey


Tulyakov, and Leonidas Guibas.


Shapetalk: A language


dataset and framework for 3d shape edits and deformations.


In CVPR, 2023. 3


[2] Antonio Alliegro, Yawar Siddiqui, Tatiana Tommasi, and


Matthias Nießner. Polydiff: Generating 3d polygonal meshes


with diffusion models.


arXiv preprint arXiv:2312.11417,


2023. 2


[3] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended


diffusion for text-driven editing of natural images. In CVPR,


pages 18208–18218, 2022. 3


[4] Manuel Brack, Felix Friedrich, Katharia Kornmeier, Linoy




Patrick Schramowski,


Kristian Kersting,




Apolin´ario Passos. Ledits++: Limitless image editing using


text-to-image models. In CVPR, 2024. 3


[5] Tim Brooks, Aleksander Holynski, and Alexei A Efros. In-


structpix2pix: Learning to follow image editing instructions.


In CVPR, 2023. 3


[6] Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi-


aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu-


tual self-attention control for consistent image synthesis and


editing. In ICCV, 2023. 3


[7] Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and


Daniel Cohen-Or.


Attend-and-excite: Attention-based se-


mantic guidance for text-to-image diffusion models. ACM


TOG, 2023. 3


[8] Hansheng Chen, Ruoxi Shi, Yulin Liu, Bokui Shen, Ji-


ayuan Gu, Gordon Wetzstein, Hao Su, and Leonidas Guibas.


Generic 3d diffusion adapter using controlled multi-view


editing. arXiv preprint arXiv:2403.12032, 2024. 2, 3, 5,


6, 7, 8


[9] Minghao Chen, Junyu Xie, Iro Laina, and Andrea Vedaldi.


Shap-editor: Instruction-guided latent 3d editing in seconds.


In CVPR, 2024. 2, 3


[10] Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fan-


tasia3d: Disentangling geometry and appearance for high-


quality text-to-3d content creation. In ICCV, 2023. 3


[11] Dale Decatur, Itai Lang, Kfir Aberman, and Rana Hanocka.


3d paintbrush: Local stylization of 3d shapes with cascaded


score distillation. In CVPR, 2024. 3


[12] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs,


Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana


Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse:


A universe of annotated 3d objects. In CVPR, 2023. 3, 6


[13] Shaocong Dong, Lihe Ding, Zhanpeng Huang, Zibin Wang,


Tianfan Xue, and Dan Xu. Interactive3d: Create what you


want by interactive 3d generation. In CVPR, 2024. 2, 3


[14] Laura Downs, Anthony Francis, Nate Koenig, Brandon Kin-


man, Ryan Hickman, Krista Reymann, Thomas B McHugh,


and Vincent Vanhoucke. Google scanned objects: A high-


quality dataset of 3d scanned household items.




2022. 6


[15] Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano,


Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip-


guided domain adaptation of image generators. ACM TOG,


2022. 6, 1


[16] Ayaan Haque, Matthew Tancik, Alexei A Efros, Aleksander


Holynski, and Angjoo Kanazawa. Instruct-nerf2nerf: Edit-


ing 3d scenes with instructions. In ICCV, 2023. 2, 3


[17] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman,


Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image


editing with cross attention control. In ICLR, 2022. 2, 3, 4


[18] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-


sion probabilistic models. In NeurIPS, 2020. 3


[19] Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou,


Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao


Tan. Lrm: Large reconstruction model for single image to


3d. In ICLR, 2024. 3


[20] Inbar Huberman-Spiegelglas, Vladimir Kulikov, and Tomer


Michaeli. An edit friendly ddpm noise space: Inversion and


manipulations. In CVPR, 2024. 2, 3, 4, 1


[21] Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun


Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg


Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with


sparse-view generation and large reconstruction model. In


ICLR, 2024. 3


[22] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa,


Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler,


Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution


text-to-3d content creation. In CVPR, 2023. 2, 3


[23] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao


Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang,


Hang Su, et al.


Grounding dino:


Marrying dino with


grounded pre-training for open-set object detection. arXiv


preprint arXiv:2303.05499, 2023. 2, 4, 1


[24] Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu,


Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang,


Marc Habermann, Christian Theobalt, et al. Wonder3d: Sin-


gle image to 3d using cross-domain diffusion.




2023. 3


[25] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia-


jun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided


image synthesis and editing with stochastic differential equa-


tions. In ICLR, 2022. 3


[26] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik,


Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:


Representing scenes as neural radiance fields for view syn-


thesis. In ECCV, 2020. 3


[27] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and


Daniel Cohen-Or. Null-text inversion for editing real images


using guided diffusion models. In CVPR, 2023. 3


[28] Bharath Raj Nagoor Kani, Hsin-Ying Lee, Sergey Tulyakov,


and Shubham Tulsiani. Upfusion: Novel view diffusion from


unposed sparse view observations. In ECCV, 2025. 3


[29] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav


Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and


Mark Chen. Glide: Towards photorealistic image generation


and editing with text-guided diffusion models. arXiv preprint


arXiv:2112.10741, 2021. 3


[30] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden-


hall. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR,


2023. 3


[31] Zhangyang Qi, Yunhan Yang, Mengchen Zhang, Long Xing,


Xiaoyang Wu, Tong Wu, Dahua Lin, Xihui Liu, Jiaqi Wang,


and Hengshuang Zhao. Tailor3d: Customized 3d assets edit-


ing and generation with dual-side images.


arXiv preprint


arXiv:2407.06191, 2024. 2, 3, 5, 6, 7, 8


[32] Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren,


Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Sko-


rokhodov, Peter Wonka, Sergey Tulyakov, et al. Magic123:


One image to high-quality 3d object generation using both


2d and 3d diffusion priors. In ICLR, 2024. 3


[33] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya


Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,


Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-


ing transferable visual models from natural language super-


vision. In ICML, 2021. 6, 1


[34] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang


Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman


R¨adle, Chloe Rolland, Laura Gustafson, et al.


Sam 2:


Segment anything in images and videos.


arXiv preprint


arXiv:2408.00714, 2024. 2, 4, 1


[35] Robin Rombach, Andreas Blattmann, Dominik Lorenz,


Patrick Esser, and Bj¨orn Ommer. High-resolution image syn-


thesis with latent diffusion models. In CVPR, 2022. 3


[36] Etai Sella, Gal Fiebelman, Peter Hedman, and Hadar


Averbuch-Elor. Vox-e: Text-guided voxel editing of 3d ob-


jects. In ICCV, 2023. 2, 3, 5, 6, 7, 8, 1


[37] Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li,


and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen-


eration. In ICLR, 2024. 2, 4, 8


[38] Yawar Siddiqui, Antonio Alliegro, Alexey Artemov, Tatiana


Tommasi, Daniele Sirigatti, Vladislav Rosov, Angela Dai,


and Matthias Nießner. Meshgpt: Generating triangle meshes


with decoder-only transformers. In CVPR, 2024. 2


[39] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-


ing diffusion implicit models. In ICLR, 2020. 3


[40] Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang,


Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian


model for high-resolution 3d content creation. arXiv preprint


arXiv:2402.05054, 2024. 3


[41] Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh,


and Greg Shakhnarovich. Score jacobian chaining: Lifting


pretrained 2d diffusion models for 3d generation. In CVPR,


2023. 3


[42] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan


Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and


diverse text-to-3d generation with variational score distilla-


tion. In NeurIPS, 2023. 3


[43] Zhengyi Wang, Yikai Wang, Yifei Chen, Chendong Xi-


ang, Shuo Chen, Dajiang Yu, Chongxuan Li, Hang Su,


and Jun Zhu.


Crm: Single image to 3d textured mesh


with convolutional reconstruction model.


arXiv preprint


arXiv:2403.05034, 2024. 3


[44] Tong Wu, Guandao Yang, Zhibing Li, Kai Zhang, Ziwei Liu,


Leonidas Guibas, Dahua Lin, and Gordon Wetzstein. Gpt-4v


(ision) is a human-aligned evaluator for text-to-3d genera-


tion. In CVPR, 2024. 2, 6


[45] Jiale Xu, Xintao Wang, Yan-Pei Cao, Weihao Cheng, Ying


Shan, and Shenghua Gao.


Instructp2p: Learning to edit


3d point clouds with text instructions.


arXiv preprint


arXiv:2306.07154, 2023. 3


[46] Yinghao Xu, Hao Tan, Fujun Luan, Sai Bi, Peng Wang, Ji-


ahao Li, Zifan Shi, Kalyan Sunkavalli, Gordon Wetzstein,


Zexiang Xu, et al. Dmv3d: Denoising multi-view diffusion


using 3d large reconstruction model. In ICLR, 2024. 3


[47] Taoran Yi, Jiemin Fang, Junjie Wang, Guanjun Wu, Lingxi


Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, and Xinggang


Wang. Gaussiandreamer: Fast generation from text to 3d


gaussians by bridging 2d and 3d diffusion models. In CVPR,


2024. 2


[48] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa.


pixelNeRF: Neural radiance fields from one or few images.


In CVPR, 2021. 3


[49] Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su.


Magicbrush: A manually annotated dataset for instruction-


guided image editing. In NeurIPS, 2024. 3


[50] Peiye Zhuang, Songfang Han, Chaoyang Wang, Aliak-


sandr Siarohin, Jiaxu Zou, Michael Vasilkovsky, Vladislav


Shakhrai, Sergey Korolev, Sergey Tulyakov, and Hsin-


Ying Lee. Gtr: Improving large 3d reconstruction models


through geometry and texture refinement.


arXiv preprint


arXiv:2406.05649, 2024. 2, 3, 5, 6, 1


6. Appendix


We present additional details about PrEditor3D in this ap-


pendix. We start by explaining some of the implementa-


tion details in Sec. 6.1. In Sec. 6.2, we discuss automatic


masking, an alternative to user-brushed masking. Sec. 6.3


follows this discussion with the effect of mask granularity


on the editing process. Finally, we explain the directional


CLIP metrics we used for baseline comparison in Sec. 6.4.


6.1. Implementation Details


We used the official implementation and checkpoint of MV-


Dream as our multi-view diffusion model. It has 256 x 256


resolution and it can generate four views by default. In all of


our generations, we set the classifier-free guidance scale of


the diffusion process to 10. Official DDPM inversion [20]


implementation only handles single-image but we modified


it to handle our four view renderings. The inversion process


takes 9 seconds on RTX 3090. With the inverted latents,


we ran our inference for 41 steps, which takes around 12


seconds on an RTX 3090. For the segmentation, we cal-


culate bounding boxes using Grounding DINO [23] for all


views and add these as constraints to SAM 2 [34] tracking.


That is to help SAM 2 with the segmentation, we constrain


each frame separately. For merging and reconstruction, we


modify GTR [50], which is a feed-forward reconstruction


model. GTR mainly operates on triplanes but just before re-


construction, those triplanes are converted into a voxel grid.


We manipulated the voxel grid it generated to merge two


different shapes.


6.2. Automatic Masking


In addition to user-brushed masks, we can also gener-


ate and operate on automatically generated masks. Even


though they limit the editing region, when compared to


user-brushed masks; they can be practically used as a start-


ing point for user-brushed masking.


We leverage our segmentation approach to replace masks


given by the user. We use an input prompt from the user


to detect the target region using Grounding DINO [23] and


SAM 2 [34]. This segmentation method gives us a mask


restricted only to the sword. As a result, the generation pro-


cess cannot go beyond that region. However, when we ac-


cept input from user masks, user can explicitly show their


intention with the mask and can generate a ”viking axe”, as


shown in Fig. 9.


We want to reiterate that although the user-brushed


masks are too coarse and not 3D-consistent, our method can


generate impressive results without modifying the original


parts of the shape. That is, a quickly drawn mask is enough


for our method to work.

+ +


+ +


+ +


+ +


+ +


+ + + + + + + + + + + + + + + + + + + + +









Figure 8. Different granularity of masking. Too fine-grained


masks can over-constrain the generation process since they only


point to the region to be replaced but do not include the user’s in-


tention. More dilation increases flexibility but can also edit more


regions than intended (e.g., the region underneath the cat). Nega-


tive dilation means erosion.


6.3. Mask Granularity


We experimented with different granularity levels for the


input masks. We started with a mask that we detected au-


tomatically using Grounding DINO [23] and SAM 2 [34].


As shown in Fig. 8. If we use the original segmentation,


then the generation is restricted to that certain region and


the model cannot have room to add ”cat” features. That is,


it tries to follow the shape of the original chicken. As we


add more dilation, it tries to add features like cat ears. This


shows the trade-off between loyalty to input and flexibility.


Based on this observation, we gave coarse masks as input


and allowed the model to edit flexibly. Thanks to our merg-


ing approach, we could still combine the edited region with


the original shape to keep the rest intact.


6.4. Directional CLIP Metrics


In Sec. 4.1-4.2, we discuss directional CLIP score met-


rics [15, 33, 36] to evaluate 3D editing fidelity, to comple-


ment other quantitative metrics that measure the quality of


the output shape. We report directional CLIP scores of dif-


ferent methods in Tab. 3 of the main paper. In this section,


we formally define and discuss the reported metrics.


CLIPdir = 1









< F i


IE F i


II, FT E FT I >,




where < ., . > refers to an inner product, F i


IE, F i


II are


the normalized CLIP image embeddings over rendered im-


ages of input and edited shapes, indexed by i, and FT E, FT I


are the corresponding normalized text embeddings of edited


and input prompts. i indexes a particular frame, while N


is the total number of rendered frames. In our directional


CLIP evaluations, we use N = 70 views rendered over


a 360 trajectory, significantly larger than the four input


views we use for our method and the baseline methods.


Automatically Generated Mask


User-Brushed Mask

+ + + + + + + + + + +

Figure 9. Comparing automatically generated mask to user-


generated mask. Users may want to do specific editing such as


replacing the “sword” with “a viking axe”. If we only rely on


automatic masking, the result may not follow the user’s intention


since the automatically generated mask can limit the editing to a


certain region. However, when we rely on explicit masking, we


can get the specific shape requested by the user.


We also introduce additional metrics inspired by CLIPdir,


but aim to fix some of its problems. First, we define


CLIPdir-cos = 1









C(F i


IE F i






where C(., .) is the cosine distance.


We also introduce two modified versions of these met-


rics, namely


CLIPdir-avg =< 1









F i


IE F i






CLIPdir-avg-cos = C( 1









F i


IE F i






that compute the same metrics over the average image em-


beddings instead of averaging scores to ensure further ro-




We also propose two similarity change error metrics,


CLIPdiff-edit and CLIPdiff-noedit


CLIPdiff-edit = 1









|C(F i


II, FT W ) C(F i


IE, FT W )|rel




CLIPdiff-noedit = 1









|C(F i


II, FT G) C(F i


IE, FT G)|rel.




Here, |x y|rel =




max(x,y), FT W is the text embed-


ding of the edited word or phrase, and FT G represents the


”generic” text. For instance, when the prompt “a chicken


riding a bike” becomes “cat riding a bike”, FT W embeds the


text “cat” and FT G embeds the text “object riding a bike”.


By measuring similarity differences of rendered images to


FT W and FT G, we aim to measure the preservation of the


object and context semantics, respectively.

+ + \ No newline at end of file diff --git a/packages/metascraper-readability/benchmark/index.js b/packages/metascraper-readability/benchmark/index.js new file mode 100644 index 000000000..b550a59b3 --- /dev/null +++ b/packages/metascraper-readability/benchmark/index.js @@ -0,0 +1,42 @@ +'use strict' + +const { readFileSync } = require('fs') + +const url = 'https://arxiv.org/pdf/2412.06592' +const html = readFileSync('./fixture.html', 'utf8') + +const jsdom = () => { + const { JSDOM, VirtualConsole } = require('jsdom') + const dom = new JSDOM(html, { url, virtualConsole: new VirtualConsole() }) + return dom.window.document +} + +const happydom = () => { + const { Window } = require('happy-dom') + const window = new Window({ url }) + const document = window.document + document.documentElement.innerHTML = html + return document +} + +const { Readability } = require('@mozilla/readability') + +const measure = fn => { + const now = Date.now() + const parsed = new Readability(fn()).parse() + return { parsed, duration: Date.now() - now } +} + +const jsdomResult = measure(jsdom) +const happydomResult = measure(happydom) + +const isEqual = (value1, value2) => + JSON.stringify(value1) === JSON.stringify(value2) + +if (!isEqual(jsdomResult.parsed, happydomResult.parsed)) { + console.error('Results are different') + process.exit(1) +} + +console.log(` jsdom: ${jsdomResult.duration}ms`) +console.log(`happydom: ${happydomResult.duration}ms`) diff --git a/packages/metascraper-readability/benchmark/package.json b/packages/metascraper-readability/benchmark/package.json new file mode 100644 index 000000000..cdd63c1d4 --- /dev/null +++ b/packages/metascraper-readability/benchmark/package.json @@ -0,0 +1,9 @@ +{ + "name": "@metascraper-readability/benchmark", + "private": true, + "version": "1.0.0", + "devDependencies": { + "dom-parser": "latest", + "happy-dom": "latest" + } +} diff --git a/packages/metascraper-readability/package.json b/packages/metascraper-readability/package.json index 0da91c17e..9c1f29bc9 100644 --- a/packages/metascraper-readability/package.json +++ b/packages/metascraper-readability/package.json @@ -25,7 +25,7 @@ "dependencies": { "@metascraper/helpers": "workspace:*", "@mozilla/readability": "~0.5.0", - "jsdom": "~25.0.1" + "happy-dom": "~16.5.3" }, "devDependencies": { "ava": "5", diff --git a/packages/metascraper-readability/src/index.js b/packages/metascraper-readability/src/index.js index 1c39083c0..599c3b020 100644 --- a/packages/metascraper-readability/src/index.js +++ b/packages/metascraper-readability/src/index.js @@ -1,9 +1,7 @@ 'use strict' const { memoizeOne, composeRule } = require('@metascraper/helpers') - const { Readability } = require('@mozilla/readability') -const { JSDOM, VirtualConsole } = require('jsdom') const parseReader = reader => { try { @@ -13,15 +11,25 @@ const parseReader = reader => { } } -const readability = memoizeOne((url, html) => { - const dom = new JSDOM(html, { url, virtualConsole: new VirtualConsole() }) - const reader = new Readability(dom.window.document) - return parseReader(reader) -}, memoizeOne.EqualityFirstArgument) +const defaultGetDocument = ({ url, html }) => { + const { Window } = require('happy-dom') + const window = new Window({ url }) + const document = window.document + document.documentElement.innerHTML = html + return document +} + +module.exports = ({ getDocument = defaultGetDocument } = {}) => { + const readability = memoizeOne((url, html, getDocument) => { + const document = getDocument({ url, html }) + const reader = new Readability(document) + return parseReader(reader) + }, memoizeOne.EqualityFirstArgument) -const getReadbility = composeRule(($, url) => readability(url, $.html())) + const getReadbility = composeRule(($, url) => + readability(url, $.html(), getDocument) + ) -module.exports = () => { return { author: getReadbility({ from: 'byline', to: 'author' }), description: getReadbility({ from: 'excerpt', to: 'description' }), diff --git a/packages/metascraper-readability/test/snapshots/index.js.md b/packages/metascraper-readability/test/snapshots/index.js.md index ff8a6a2a5..8ad5da2e7 100644 --- a/packages/metascraper-readability/test/snapshots/index.js.md +++ b/packages/metascraper-readability/test/snapshots/index.js.md @@ -46,8 +46,8 @@ Generated by [AVA](https://avajs.dev). { author: null, - description: null, + description: 'Virtual Tour of 219 Shale Rd.', lang: null, publisher: null, - title: null, + title: '219 Shale Rd - Virtual Tour', } diff --git a/packages/metascraper-readability/test/snapshots/index.js.snap b/packages/metascraper-readability/test/snapshots/index.js.snap index 1ce165382..6507358ba 100644 Binary files a/packages/metascraper-readability/test/snapshots/index.js.snap and b/packages/metascraper-readability/test/snapshots/index.js.snap differ