Queue fetcher updates #266

philbudne · 2024-03-15T04:49:27Z

Take another page from scrapy scheduling internals

Override _on_input_message, which runs in the Pika thread when a new
message is received from RabbitMQ, and rather than just queuing the
message to the internal work queue (for consumption by worker
threads), decode it, and see when it could next be started (using the
Slot.issue_interval calculated from avg_seconds for request completion
to keep "next_issue" time).  If the delay is less than the fast delay
queue time (set with --busy-delay-minutes), use the Pika connection
"call_later" method to delay putting the message on the work queue
until it's ripe to be issued.  If the delay is longer than
busy-delay-minutes, requeue the message to the -fast queue.

This GREATLY reduces use of the -fast queue (lower CPU load) AND means
that requests can be started as soon as possible, without waiting for
the message to come around through RabbitMQ (better thruput).

Also: default worker count to the number of available CPU cores.

Override _on_input_message, which runs in the Pika thread when a new message is received from RabbitMQ, and rather than just queuing the message to the internal work queue (for consumption by worker threads), decode it, and see when it could next be started (using the Slot.issue_interval calculated from avg_seconds for request completion to keep "next_issue" time). If the delay is less than the fast delay queue time (set with --busy-delay-minutes), use the Pika connection "call_later" method to delay putting the message on the work queue until it's ripe to be issued. If the delay is longer than busy-delay-minutes, requeue the message to the -fast queue. This GREATLY reduces use of the -fast queue (lower CPU load) AND means that requests can be started as soon as possible, without waiting for the message to come around through RabbitMQ (better thruput). Also: default worker count to the number of available CPU cores.

xml.sax parser cannot ignore non-conforming XML (control characters in titles) * added lxml-stubs package * ran make upgrade

kilemensi · 2024-03-19T05:45:04Z

indexer/worker.py

@@ -565,8 +563,17 @@ def _on_message(
        """
        im = InputMessage(chan, method, properties, body, time.monotonic())
        msglogger.debug("on_message tag #%s", method.delivery_tag)
+        self._on_input_message(im)
+
+    def _on_input_message(self, im: InputMessage) -> None:


If it's designed to be override-able, shouldn't it be named on_input_message (without _ prefix?

It's still an interface that MOST applications should never use/override, and perhaps with time I'll come up with a clean way to abstract away the nastiness!

kilemensi · 2024-03-19T06:12:47Z

indexer/workers/fetcher/rss-queuer.py

+        self.link: str | None = ""
+        self.domain: str | None = ""
+        self.pub_date: str | None = ""
+        self.title: str | None = ""


Any reason it's preferred to initialise these to "" instead of None like the rest of fields?

No ready answer! The ones with None values are ones that were added recently, and therefore REALLY optional. I didn't originally didn't have the ones above as "optional" (see line 61 in red) but the change in XML parser might have forced my hand, and I didn't think too hard about how it looked....

kilemensi · 2024-03-19T06:27:33Z

indexer/workers/fetcher/rss-queuer.py

+            # mypy reval_type(rss) in "with s.rss_entry() as rss" gives Any!!
+            rss = s.rss_entry()
+            with rss:


Yeah, weird edge cases... not sure if it works, but PEP 526 says we can (should?) annotate the variable before using it.

rss: RSSEntry with s.rss_entry() as rss:

I haven't been pre-declaring variables that don't have an initial value, unless the initial assignment doesn't paint a complete picture like:

var: int | str if condition: var = 1 else: var = None

mypy does a good job inferring the types, and I'd rather not have to add clutter (unless I'm overruled)!

One thing in PEP 526 that caught my eye was

PEP 484 explicitly states that type comments are intended to help with type inference in complex cases, and this PEP does not change this intention.

which aligns with my attitude that declaring variable type is there for when it's needed

Re-reading my original comment, I see it can be misunderstood.

not sure if it works, but PEP 526 says we can (should?) annotate the variable before using it if inference fails*.

Added the "if inference fails" part. I assumed this could be inferred (pun intended) from the context i.e. the # mypy reval_type(rss) in "with s.rss_entry() as rss" gives Any!! comment line

Otherwise yes, there shouldn't be any need for explicit type hinting.

kilemensi · 2024-03-19T06:30:07Z

pyproject.toml

@@ -10,6 +10,7 @@ dependencies = [
  "boto3 ~= 1.28.44",
  "docker ~= 6.1.0",
  "elasticsearch ~= 8.12.0",
+  # lxml installed by some other package?


💯 I think both mediacloud-metadata (via trafilatura) and scrapy requires lxml.

kilemensi · 2024-03-19T11:42:24Z

indexer/workers/fetcher/tqfetcher.py

+            target_concurrency=self.args.target_concurrency,
+            max_delay_seconds=self.busy_delay_seconds,
+            conn_retry_seconds=self.args.conn_retry_minutes * 60,
+            min_interval_seconds=MIN_INTERVAL_SECONDS,


Shouldn't we use value passed in via --min-interval-seconds arg?

Good catch!!! Thanks!!!!

@kilemensi

…n review by @kilemensi

philbudne · 2024-03-20T17:13:46Z

Ah, I understand now.... Good catch!! I noted the "Any" issue in February as #233 which we put off as "long-term" It looks like there are 29 uses of Story sub-object "getter" methods in "with" statments (plus a comment in indexer/workers/fetcher/rss-queuer.py that notes the problem!): # mypy reval_type(rss) in "with s.rss_entry() as rss" gives Any!! And the explicit hint, _does_ seem to solve the problem: (venv) ***@***.***:~/story-indexer$ cat a.py from indexer.story import BaseStory, RSSEntry def foo(s: BaseStory) -> None: reveal_type(s.rss_entry()) with s.rss_entry() as r: reveal_type(r) r2: RSSEntry with s.rss_entry() as r2: reveal_type(r2) r3 = s.rss_entry() reveal_type(r3) with r3: reveal_type(r3) (venv) ***@***.***:~/story-indexer$ mypy a.py ... a.py:4: note: Revealed type is "indexer.story.RSSEntry" a.py:6: note: Revealed type is "Any" a.py:10: note: Revealed type is "indexer.story.RSSEntry" a.py:13: note: Revealed type is "indexer.story.RSSEntry" a.py:15: note: Revealed type is "indexer.story.RSSEntry" I'd like to understand the failure before applying a work-around, and I'd prefer the assignment (since it avoids needing to explicitly type the variable), plus a comment like "mypy gets with ..... wrong" to a variable declaration that looks extraneous. The problem is easy to reproduce, with a simple class with mypy (but pytype gets it right) so I've asked about it on a python/typing chat.

Phil Budne added 3 commits March 13, 2024 22:13

Convert indexer/workers/fetcher/rss-queuer.py to lxml iterparse

6c2a0d7

xml.sax parser cannot ignore non-conforming XML (control characters in titles) * added lxml-stubs package * ran make upgrade

indexer/workers/fetcher/tqfetcher.py: fix post redirect logging/tests

866e889

philbudne requested review from pgulley and kilemensi March 15, 2024 04:49

Phil Budne added 2 commits March 15, 2024 23:05

Comments and parameter tuning

16f3b92

slots gauges

6b5f124

pgulley approved these changes Mar 18, 2024

View reviewed changes

kilemensi approved these changes Mar 19, 2024

View reviewed changes

workers/fetcher/tqfetcher.py: use args.min_interval_seconds! caught i…

fa55202

…n review by @kilemensi

philbudne merged commit a7a8a8e into mediacloud:main Mar 20, 2024

philbudne deleted the qfetch-later branch March 20, 2024 20:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queue fetcher updates #266

Queue fetcher updates #266

philbudne commented Mar 15, 2024

kilemensi Mar 19, 2024

philbudne Mar 19, 2024

kilemensi Mar 19, 2024

philbudne Mar 19, 2024

kilemensi Mar 19, 2024

philbudne Mar 19, 2024

kilemensi Mar 20, 2024 •

edited

Loading

kilemensi Mar 19, 2024

kilemensi Mar 19, 2024

philbudne Mar 20, 2024

philbudne commented Mar 20, 2024 via email

Queue fetcher updates #266

Queue fetcher updates #266

Conversation

philbudne commented Mar 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kilemensi Mar 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philbudne commented Mar 20, 2024 via email

kilemensi Mar 20, 2024 •

edited

Loading