Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeEncodeError in mergerfs.balance #106

Closed
cryzed opened this issue Jul 17, 2020 · 5 comments
Closed

UnicodeEncodeError in mergerfs.balance #106

cryzed opened this issue Jul 17, 2020 · 5 comments

Comments

@cryzed
Copy link

cryzed commented Jul 17, 2020

Hey, I have seen that this was an issue before and that you tried to solve it by replacing sys.stdout and sys.stderr with a TextIOWrapper that hopefully just surrogateescapes encoding errors... but this didn't work for me for some files, causing me to just wrap the print lines in try-except UnicodeEncodeError blocks.

When I took a closer look, I realized there's something strange going on. You shouldn't need the TextIOWrapper workaround at all. Every path that find_a_file (os.walk) returns should be a regular string (Unicode codepoints), i.e. it was already successfully decoded by Python at this point. If you then try to re-encode them through the TextIOWrapper instance that uses locale.getpreferredencoding(False) as its encoding, the encoding might actually fail, because that's not necessarily an encoding that the filesystem itself uses (I think). What is strange however is, that my result for locale.getprefferencoding(False) definitely returns UTF-8, so I'm confused how the encoding error could have arisen in the first place (I definitely don't have filenames/paths that fall outside the UTF-8 table). It feels like something is messing with the set locale while the script is running... rsync maybe? I saw in its source code that it likes to set LC_TYPE to an empty string, but I'm not sure if this would result in a different locale.getprefferencoding(False) output.

My suggestion would be to just force UTF-8 encoding in the TextIOWrapper instances, which IMHO should be able to encode every string you throw at it into bytes -- even if the output in the console might look a bit funny then (because of a locale mismatch), it's better than the script crashing. Maybe this isn't a good idea, I'm not sure. I unfortunately can't test this super well, because I'm already "past" the files that caused errors, and didn't take note of their names. Maybe you have some input on this?

@trapexit
Copy link
Owner

Funny this should come up for you as someone else ran into similar with my other tool trapexit/scorch#38

And yeah... I addressed it by forcing utf8 in the textiowrapper. It appears to work in their case as far as I think I reproduced it. Been waiting for confirmation before adding it to all my python tools.

@cryzed
Copy link
Author

cryzed commented Jul 17, 2020

Yes it seems really likely now that something is messing with the locale settings during the runtime of mergerfs.balance, causing locale.getprefferencoding(False) to return "ASCII" instead of the actually configured locale. A possibly workaround might be to store the output of locale.getprefferencoding(False) before you call into other code or run any subprocess and then use the output value from that for TextIOWrappers encoding argument... or maybe just forcing UTF-8 is smarter. You could also not use surrogateescape and use backslashreplace instead. It's a bit confusing what's happening here.

@trapexit
Copy link
Owner

I'm not sure why I went with surrogateescape over backslashreplace. Might have just been what I found while searching.

@trapexit
Copy link
Owner

@cryzed
Copy link
Author

cryzed commented Jul 17, 2020

Thanks, this seems to work -- but as I said, I'm currently not re-balancing the same files I was re-balancing earlier. I'll keep an eye on it for now.

@cryzed cryzed closed this as completed Jul 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants