You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 30, 2020. It is now read-only.
Trying to index a doc whose docid contains a "high ascii" or Unicode character above 127 causes the following exception in restapi:
17669 05/02-00.50.12 RPC:ERRO Unexpected failure to run send_batch, reconnecting once @rpc.py:87Traceback (most recent call last): File "../api/rpc.py", line 77, in wrap return att(*args, **kwargs) File "../gen-py/flaptor/indextank/rpc/LogWriter.py", line 39, in send_batch self.send_send_batch(batch) File "../gen-py/flaptor/indextank/rpc/LogWriter.py", line 46, in send_send_batch args.write(self._oprot) File "../gen-py/flaptor/indextank/rpc/LogWriter.py", line 139, in write self.batch.write(oprot) File "../gen-py/flaptor/indextank/rpc/ttypes.py", line 1679, in write iter138.write(oprot) File "../gen-py/flaptor/indextank/rpc/ttypes.py", line 1441, in write oprot.writeString(self.docid) File "../api/thrift/protocol/TBinaryProtocol.py", line 123, in writeString self.trans.write(str) File "../api/thrift/transport/TTransport.py", line 164, in write self.__wbuf.write(buf)UnicodeEncodeError: 'ascii' codec can't encode character u'\xe2' in position 0: ordinal not in range(128)
I think it's ok to reject docids with non-latin1 or non-ascii characters, but I think it should return an HTTP 400 instead of 503 "service unavailable". (Or maybe docids are supposed to accept non-ascii characters?)
Also, this seems to be related but I'm not sure yet: when indexing in batches when this happened, it seemed to cause some problem with the LogWriter, with the following stack trace:
ERROR [pool-1-thread-32] org.apache.thrift.server.TThreadPoolServer - [Erroroccurredduringprocessingofmessage.] 2012-02-0410:27:15,724java.lang.IllegalStateException: Can't insert records to the live log without defining the index code
atcom.google.common.base.Preconditions.checkState(Preconditions.java:145)
atcom.flaptor.indextank.storage.RawLog.write(RawLog.java:61)
atcom.flaptor.indextank.storage.LogWriterServer.send_batch(LogWriterServer.java:87)
atcom.flaptor.indextank.rpc.LogWriter$Processor$send_batch.process(LogWriter.java:214)
atcom.flaptor.indextank.rpc.LogWriter$Processor.process(LogWriter.java:193)
atorg.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
atjava.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
atjava.lang.Thread.run(Thread.java:662)
Finally, after this all happened, the LogWriter (slave) was taking all the CPU when no docs were being written, like it was in a spin loop. I did a kill -3 to get a thread stack dump, and one or two threads were RUNNABLE at this line:
at org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:129)
at org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:60)
at com.flaptor.indextank.rpc.LogRecord.read(LogRecord.java:900)
...
I can create a separate issue for the LogWriter stuff if you want. But I'm not sure exactly what reproduces it yet.
Let me know if I can provide any more details.
The text was updated successfully, but these errors were encountered:
I'm not a python expert, but I dug around, I noticed that thrift uses StringIO, and I found this in the python docs:
The StringIO object can accept either Unicode or 8-bit strings, but mixing the two may take some care. If both are used, 8-bit strings that cannot be interpreted as 7-bit ASCII (that use the 8th bit) will cause a UnicodeError to be raised when getvalue() is called.
Trying to index a doc whose docid contains a "high ascii" or Unicode character above 127 causes the following exception in restapi:
To reproduce, do this in python:
I think it's ok to reject docids with non-latin1 or non-ascii characters, but I think it should return an HTTP 400 instead of 503 "service unavailable". (Or maybe docids are supposed to accept non-ascii characters?)
Also, this seems to be related but I'm not sure yet: when indexing in batches when this happened, it seemed to cause some problem with the LogWriter, with the following stack trace:
Finally, after this all happened, the LogWriter (slave) was taking all the CPU when no docs were being written, like it was in a spin loop. I did a kill -3 to get a thread stack dump, and one or two threads were RUNNABLE at this line:
I can create a separate issue for the LogWriter stuff if you want. But I'm not sure exactly what reproduces it yet.
Let me know if I can provide any more details.
The text was updated successfully, but these errors were encountered: