Skip to content
This repository has been archived by the owner on Oct 30, 2020. It is now read-only.

indexing docid with non-ascii characters causes 503 error #7

Open
clamprecht opened this issue Feb 5, 2012 · 3 comments
Open

indexing docid with non-ascii characters causes 503 error #7

clamprecht opened this issue Feb 5, 2012 · 3 comments

Comments

@clamprecht
Copy link
Contributor

Trying to index a doc whose docid contains a "high ascii" or Unicode character above 127 causes the following exception in restapi:

17669 05/02-00.50.12      RPC:ERRO Unexpected failure to run send_batch, reconnecting once @rpc.py:87
Traceback (most recent call last):
  File "../api/rpc.py", line 77, in wrap
    return att(*args, **kwargs)
  File "../gen-py/flaptor/indextank/rpc/LogWriter.py", line 39, in send_batch
    self.send_send_batch(batch)
  File "../gen-py/flaptor/indextank/rpc/LogWriter.py", line 46, in send_send_batch
    args.write(self._oprot)
  File "../gen-py/flaptor/indextank/rpc/LogWriter.py", line 139, in write
    self.batch.write(oprot)
  File "../gen-py/flaptor/indextank/rpc/ttypes.py", line 1679, in write
    iter138.write(oprot)
  File "../gen-py/flaptor/indextank/rpc/ttypes.py", line 1441, in write
    oprot.writeString(self.docid)
  File "../api/thrift/protocol/TBinaryProtocol.py", line 123, in writeString
    self.trans.write(str)
  File "../api/thrift/transport/TTransport.py", line 164, in write
    self.__wbuf.write(buf)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe2' in position 0: ordinal not in range(128)

To reproduce, do this in python:

from indextank.client import ApiClient
c = ApiClient('<YOUR_API_URL>')  
idx = c.create_index('testascii')
idx.add_document("â", { "text":"a"}) 

I think it's ok to reject docids with non-latin1 or non-ascii characters, but I think it should return an HTTP 400 instead of 503 "service unavailable". (Or maybe docids are supposed to accept non-ascii characters?)

Also, this seems to be related but I'm not sure yet: when indexing in batches when this happened, it seemed to cause some problem with the LogWriter, with the following stack trace:

ERROR [pool-1-thread-32] org.apache.thrift.server.TThreadPoolServer - [Error occurred during processing of message.] 2012-02-04 10:27:15,724
java.lang.IllegalStateException: Can't insert records to the live log without defining the index code
        at com.google.common.base.Preconditions.checkState(Preconditions.java:145)
        at com.flaptor.indextank.storage.RawLog.write(RawLog.java:61)
        at com.flaptor.indextank.storage.LogWriterServer.send_batch(LogWriterServer.java:87)
        at com.flaptor.indextank.rpc.LogWriter$Processor$send_batch.process(LogWriter.java:214)
        at com.flaptor.indextank.rpc.LogWriter$Processor.process(LogWriter.java:193)
        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

Finally, after this all happened, the LogWriter (slave) was taking all the CPU when no docs were being written, like it was in a spin loop. I did a kill -3 to get a thread stack dump, and one or two threads were RUNNABLE at this line:

at org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:129)
at org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:60)
at com.flaptor.indextank.rpc.LogRecord.read(LogRecord.java:900)
...

I can create a separate issue for the LogWriter stuff if you want. But I'm not sure exactly what reproduces it yet.

Let me know if I can provide any more details.

@dbuthay
Copy link
Contributor

dbuthay commented Feb 6, 2012

We should support unicode docids. Actually, __validate_docid on api/restapi.py.

Check the code at https://github.com/linkedin/indextank-service/blob/master/api/restapi.py#L45

So it seems the code sending the update to the LogStorage is not supporting non-ascii docids ..

@clamprecht
Copy link
Contributor Author

I'm not a python expert, but I dug around, I noticed that thrift uses StringIO, and I found this in the python docs:

The StringIO object can accept either Unicode or 8-bit strings, but mixing the two may take some care. If both are used, 8-bit strings that cannot be interpreted as 7-bit ASCII (that use the 8th bit) will cause a UnicodeError to be raised when getvalue() is called.

from http://docs.python.org/library/stringio.html

Maybe this is what's happening (and why it's happening in the middle of a thrift call)?

@clamprecht
Copy link
Contributor Author

It also seems that when batch indexing, a single document causing this issue in the batch can cause the whole batch to fail and return a 503.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants