Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File Metadata error when parsing HDoujin Downloader's info.json files inside zip files #40

Open
Dystasia opened this issue Sep 13, 2020 · 6 comments

Comments

@Dystasia
Copy link

File Metadata parser fails for info.json files generated from HDoujin Downloader when inside zip files. Same info.json when extracted parses with no issues whatsoever.

Here is the plugin.log:

Sep-09 00:16:49--INFO pluginctx.file-metadata.main: Attempting with DataType.eze
Sep-09 00:16:49--WARNING pluginctx.file-metadata.extractors.common: An error occured while trying to parse file into a dict
Sep-09 00:16:49--INFO pluginctx.file-metadata.main: Skipping DataType.eze
Sep-09 00:16:49--INFO pluginctx.file-metadata.main: Attempting with DataType.hdoujin
Sep-09 00:16:49--WARNING pluginctx.file-metadata.extractors.common: An error occured while trying to parse file into a dict
Sep-09 00:16:49--INFO pluginctx.file-metadata.main: Skipping DataType.hdoujin

Let me know if you need an exmaple, but really this is happening with all my files.

@Dystasia
Copy link
Author

Actually, it is not all of them. I am trying to identify the differences but I am guessing it has something to do with the structure of some info files.

@Dystasia
Copy link
Author

Dystasia commented Sep 14, 2020

Ok I found the issue. It has something to do with special characters when zipped. This Json works when unzipped but not when zipped:

@zatsuna
Copy link

zatsuna commented Sep 14, 2020

@Dystasia
I only have zip and rar files. I did some testing and here's what I found out.
The File Metadata plugin finds and successfully adds tags but only if the folder is unzipped. I don't have any unzipped galleries, so I didn't notice this before. I have many .zip galleries and none works with File Metadata.
It worked fine with .zip galleries in HPX from a year before.

Also, I don't get duplicate galleries with unzipped folders when scanning for new galleries. If galleries are zipped, I always get duplicates of every gallery regardless of "Scan only for new galleries" option being selected. Every scan adds another duplicate.

These two issues are probably related to each other as they both are solved by unzipping.

@Dystasia
Copy link
Author

Dystasia commented Sep 17, 2020

Just an update of how I attempted to fix this.

First, the exception actually thrown when trying to parse is:
'charmap' codec can't decode byte 0x9d in position 314: character maps to <undefined>

This probably means, the reading of the file is happening without utf-8 encoding.

The reading and parsing of the file is happening in:

with fs.open("r", **kw) as f:
d = json.load(f)

even tho the encoding seems to get set at:

if not fs.inside_archive:
kw['encoding'] = 'utf-8'

this doesn't seem to work for compressed info.json files. Attempting to remove the if condition I get the exception:
open() got an unexpected keyword argument 'encoding'

I can't see the content of hpx.command.CoreFS even tho the documentation states it is a file handler/wrapper, so I'm kinda stuck on not knowing the interface of this class or how to try and force the encoding in another way.

@twiddli have any inputs? is this something that needs to be fixed in hpx core instead of the plugin?

@twiddli
Copy link
Member

twiddli commented Sep 17, 2020

Hello, thank you guys for the troubleshooting. This is such a weird issue as I still can't repro it yet.
Creating a zip file with an info.json with the contents:

{ "manga_info": { "title": "Bad Girl", "original_title": "", "author": [], "artist": [ "INAGO" ], "circle": [], "scanlator": [], "translator": [], "publisher": "FAKKU", "description": "It’s because I’m a good student…that I need some stimulation. ❤", "status": "", "chapters": "N/A", "pages": 20, "tags": { "Misc": [ "Schoolgirl Outfit", "Creampie", "Deepthroat", "Exhibitionism", "Glasses", "Hentai", "Humiliation", "Loli", "Masturbation", "Teacher", "Toys", "Uncensored", "X-Ray" ] }, "type": "", "language": [ "English" ], "released": "", "reading_direction": "", "characters": [], "series": "", "parody": [ "Original Work" ], "url": "https://hentainexus.com/read/6019" } } 

works totally fine, I even put the character ❤ in the filename for good measure and got no issues.

Can you check if the file is utf-8 encoded?

Also, for more insight on what's happening on that line of code, it checks if the file is inside the archive and omits specifying the encoding because the archive handler from the std lib doesn't accept an encoding parameter when opening files from inside the archive. I think this is because it is assumed the encoding is utf-8.

Saving the info.json file inside the archive with a different encoding than utf-8, I get this error: 'CP_UTF8' codec can't decode bytes in position 0--1: No mapping for the Unicode character exists in the target code page. suggesting that it expects utf-8 for all text files.

@zatsuna
Copy link

zatsuna commented Sep 18, 2020

All my files generated by E-Hentai Downloader have a UTF-8 info.txt.

Sample info file:
info.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants