Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OH 4.3.0/4.3.1 is never getting past startlevel 30 for some users #4547

Open
rkoshak opened this issue Jan 8, 2025 · 15 comments
Open

OH 4.3.0/4.3.1 is never getting past startlevel 30 for some users #4547

rkoshak opened this issue Jan 8, 2025 · 15 comments
Labels
bug An unexpected problem or unintended behavior of the Core

Comments

@rkoshak
Copy link

rkoshak commented Jan 8, 2025

Expected Behavior

When OH starts the rule engine starts and begins executing rules.

Current Behavior

It's been reported at least three times now on the forum:

The first link has the most thorough investigation. I'm not positive that it's the same root problem for all three as they differ in details.

In all three cases the problem appeared when upgrading to 4.3.1, but backporting to 4.3.0 does not fix the problem. So it might be a regression from before 4.3.0 release.

File based Rules DSL rules seem to be involved in all three cases.

Possible Solution

Steps to Reproduce (for Bugs)

This is trickly because clearly it's not happening for everyone.

Context

Users who face this have no rules running.

Mitigation is to go back to 4.2.

Your Environment

  • Version used: (e.g., openHAB and add-on versions)
  • Environment name and version (e.g. Chrome 111, Java 17, Node.js 18.15, ...):
  • Operating System and version (desktop or mobile, Windows 11, Raspbian Bullseye, ...):
@rkoshak rkoshak added the bug An unexpected problem or unintended behavior of the Core label Jan 8, 2025
@openhab-bot
Copy link
Collaborator

This issue has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/rule-engine-does-not-start-after-openhab-restart-due-to-high-amount-of-rules/161440/7

@openhab-bot
Copy link
Collaborator

This issue has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/rule-cron-blocking-openhab/161469/42

@openhab-bot
Copy link
Collaborator

This issue has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/rule-engine-not-running-although-bundle-active/161517/10

@jlaur
Copy link
Contributor

jlaur commented Jan 8, 2025

I was able to reproduce the invalid cron expression issue by adding this rule in a file:

rule "Test"
when
    Time cron "0 0  5 31 2 ? ?"
    //         s m hh DD MM WW YYYY
then
    // Nothing
end

The rule will stay in state INITIALIZING and it's impossible to shut down openHAB. When already having this rule while starting openHAB, no rules will be processed after trying to load this rule and the system is stuck at run level 20.

@stefan-hoehn
Copy link
Contributor

@mstormi
Copy link

mstormi commented Jan 10, 2025

I'm inclined to believe it's related to memory usage.
I have boxes with both, 64bit and 32bit OS/Java, and the problem shows up only on 64bit Raspis. They show an exceptional increase in mem usage like 4G vs 1.1G for about the same OH config on a 32bit OS.
Someone else (see links in post #1) ran on x86 but he also only had spent 4G on his VM.

Please let's collect information here to help with diagnosing.

Anyone to have the problem that file based DSL rules do not execute, please:
Name your HW, OS variant (64 bit ???) and give the output of java -version. Have a look at your java process size and swap size.
If possible, downgrade to 4.2.3 or better any 4.3 snapshot that your system used to work with and let us know the latest version you don't have a problem with.

Check bundle:list to see if there's any bundle in non-Active state.
Set log:set debug org.openhab.core.automation it will show you when the rules engine is not started (for me, the rules engine bundle is active so it's not easy to spot that it does not work).

If you know of any specific log setting to help with further diagnosis or a log message that should appear but does not, let us know.

@kstuken
Copy link

kstuken commented Jan 10, 2025

Have started experiencing something quite similar after upgrading from 4.2 to 4.3.
Running on Openhabian 64 bit, Raspberry 4 with 4GB, system is running only Openhab.
It is a fairly large system, 1100 items, 135 things, lots of JS rules. All files based. No DSL rules, though. (Actually, just noticed that there was still one DSL rule I had not commented out (which I did now); but that one should never have triggered)
While I have been on Openhab since 2.x, I installed from scratch when migrating from 3.x to 4.2, and then brought everything back manually, ie there should be no ballast / pollution from old backups.

For me the issue occurs occasionally. I shut down OH via shell script every night to do some backup and maintenance, and then restart again after a few minutes. This has worked well for many years.
Starting with the upgrade to 4.3 this now fails occasionally. On the occasions it fails the system will get to runlevel 40, but then the rule engine will just not start, and runlevel will not go higher anymore. JS rules are loaded after the rule engine started, so none of the JS rules get loaded.
System otherwise runs as expected, but no rules. This has occured perhaps six times since I installed OH 4.3 on Dec 19, so every fourth or fifth system restart. So far, sysctl restart openhab has reliably fixed the issue in those cases.
There is nothing obvious in the logs that distinguish a failed start and a successful start, other than the lack of rules engine started and everything that comes after it.
I have now turned on debug logging for automation, as suggested by Markus.

Output of java --version:

openjdk 17.0.13 2024-10-15
OpenJDK Runtime Environment (build 17.0.13+11-Debian-2deb12u1)
OpenJDK 64-Bit Server VM (build 17.0.13+11-Debian-2deb12u1, mixed mode, sharing)

Let me know in case you would want further info on the system.

@morph166955
Copy link

I'm inclined to believe it's related to memory usage.
I have boxes with both, 64bit and 32bit OS/Java, and the problem shows up only on 64bit Raspis. They show an exceptional increase in mem usage like 4G vs 1.1G for about the same OH config on a 32bit OS.
Someone else (see links in post #1) ran on x86 but he also only had spent 4G on his VM.

Out of curiosity, are you running bindings that use jupnp? I have an utterly enormous amount of memory usage because of my upnp environment. If I disable those bindings, memory usage goes way down. I ended up moving my OH off to a higher end Intel NUC11 to make sure I had plenty of memory to run everything.

@mstormi
Copy link

mstormi commented Jan 11, 2025

no jupnp, not that I'm aware

@mstormi
Copy link

mstormi commented Jan 11, 2025

FWIW I now have a 4.3.1 instance that says it's at runlevel 100 - but log output says the engine doesn't run
After one restart, it now works fine.

@rkoshak pls change this issue's title as the runlevel IS 100

@rkoshak
Copy link
Author

rkoshak commented Jan 11, 2025

I'm not sure that there isn't two if not three different issues going on here.

One issue has been identified and I think it's been fixed already. Giving Quartz a bogus cron expression leads to an infinite loop.

However, others are being stuck at start level 30 and now you are reporting getting past that, and no cron expressions are involved.

The only real commonality I can see is everyone with the problem is using file based Rules DSL and a lot of RAM (though is it an unusual amount of RAM? 🤷

I don't know if it still makes sense to have these all combined in one issue, or if it makes sense to split them up as they might not be related problems after all.

@mstormi
Copy link

mstormi commented Jan 11, 2025

Are we sure that in the original we couldn't get past runlevel 30 or did you only indirectly assume that because no rule was executing ?

On RAM, I agree. I have 64bit installations that work (with more or less same config) and are of equal size (4+ G) which is huge but apparently not unusual and not all in itself a symptom or problem.

@rkoshak
Copy link
Author

rkoshak commented Jan 12, 2025

Are we sure that in the original we couldn't get past runlevel 30 or did you only indirectly assume that because no rule was executing ?

Two of the three threads I linked to above confirmed that runlevel was stuck at 30 before I created the issue. Your thread actually popped up in the middle of my creating the issue and I didn't ask for confirmation there.

Checking the runlevel was the one of the first things I asked about on the other threads though. Your report is the first that shows a problem with OH making it past runlevel 30 so that kind of indicates it might be a different issue with the same symptom.

We know that one issue was the cron expression issue for sure. Those experiencing the problem without a bogus cron trigger remain explanation and mitigation.

Over all, it would be very useful if those who have cron triggered rules and are able can test with the snapshots (I know that's a bigger lift now with OH 5 than usual) and confirm their problem is fixed. Then maybe we can tell if those who got stuck at start level 30 was because of the infinite loop and separate those from users experiencing the same symptom for different reasons.

@holgerfriedrich
Copy link
Member

holgerfriedrich commented Jan 12, 2025

@rkoshak All bugfixes should be backported to 4.3.x branch, and will thus be included in our 4.3.2 patch release (expected later today). This should be easier to test for users who have not yet moved to 64-bit/Java 21.

@rkoshak
Copy link
Author

rkoshak commented Jan 12, 2025

I was hoping it was going to be back ported and I didn't know 4.3.2 was coming out so soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An unexpected problem or unintended behavior of the Core
Projects
None yet
Development

No branches or pull requests

8 participants