Possible to get HTML for winning content? #47

kevzettler · 2017-03-21T02:35:05Z

Is there any method to get the HTML for the winning block content? I'd like to also get img code pre elements and preserve formatting with p and heading tags where possible.

The text was updated successfully, but these errors were encountered:

matt-peters · 2017-03-28T00:24:22Z

Unfortunately not in an easy manner. You can get the lxml etree object for the start tag of each block then use it to reconstruct the HTML but that's not easy to do. If you'd like to try to implement, you an start by passing blocks=True to analyze when extracting the content. This will return a list of block objects for the extracted content. Then block.features['block_start_element'] contains the object.

Something like:

blocks = content_extractor.analyze(html, blocks=True)
start_elements  = [block.features['block_start_element'] for block in blocks]

rferreiraperez · 2018-01-10T09:12:01Z

It would be possible to keep at least the line breaks in result text?

MSusik · 2018-06-29T13:01:35Z

@rferreiraperez A workaround has been shown in #22

EDIT:

In current version, you can get blocks with: dragnet.extract_content_and_comments(site, as_blocks=True)

lukaspistelak · 2020-03-30T15:05:58Z

blocks = content_extractor.analyze(html, blocks=True) start_elements = [block.features['block_start_element'] for block in blocks]

AttributeError: 'Extractor' object has no attribute 'analyze'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible to get HTML for winning content? #47

Possible to get HTML for winning content? #47

kevzettler commented Mar 21, 2017

matt-peters commented Mar 28, 2017

rferreiraperez commented Jan 10, 2018

MSusik commented Jun 29, 2018 •

edited

Loading

lukaspistelak commented Mar 30, 2020

Possible to get HTML for winning content? #47

Possible to get HTML for winning content? #47

Comments

kevzettler commented Mar 21, 2017

matt-peters commented Mar 28, 2017

rferreiraperez commented Jan 10, 2018

MSusik commented Jun 29, 2018 • edited Loading

lukaspistelak commented Mar 30, 2020

MSusik commented Jun 29, 2018 •

edited

Loading