-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible to get HTML for winning content? #47
Comments
Unfortunately not in an easy manner. You can get the lxml etree object for the start tag of each block then use it to reconstruct the HTML but that's not easy to do. If you'd like to try to implement, you an start by passing Something like: blocks = content_extractor.analyze(html, blocks=True)
start_elements = [block.features['block_start_element'] for block in blocks] |
It would be possible to keep at least the line breaks in result text? |
@rferreiraperez A workaround has been shown in #22 EDIT: In current version, you can get blocks with: |
AttributeError: 'Extractor' object has no attribute 'analyze' |
Is there any method to get the HTML for the winning block content? I'd like to also get
img
code
pre
elements and preserve formatting withp
and heading tags where possible.The text was updated successfully, but these errors were encountered: