Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Travis CI linux-aarch64, linux-ppc64le jobs failing #185

Open
mfansler opened this issue Aug 1, 2024 · 44 comments
Open

Travis CI linux-aarch64, linux-ppc64le jobs failing #185

mfansler opened this issue Aug 1, 2024 · 44 comments

Comments

@mfansler
Copy link
Member

mfansler commented Aug 1, 2024

Travis CI linux-aarch64 and linux-ppc64le jobs continue to have high failure rates (10%-25%). This manifests either as infinite queuing or premature cancellation. Restarting the individual build jobs is often sufficient, however, maintainers may also consider moving builds to Azure following #185 (comment).

@mfansler
Copy link
Member Author

mfansler commented Aug 1, 2024

I have been in touch with Travis support via email but no resolution yet.

@h-vetinari
Copy link
Member

I've seen the issue on the gtest feedstock as well, independently of R. In any case, I'm not 100% sure this qualifies as "major". According to the status page we build around 100-150x more on azure than on travis, so <0.5% of our builds are affected1, and it's possible (at least in principle) to switch them to azure (either emulated or cross-compiled).

I know this is splitting hairs a bit, so no need to change anything per se (I was thinking along the lines of avoiding a "boy who cried wolf" situation where people evenetually don't take our status seriously, but one time isn't going to do that).

That said, thanks a lot for trying to the bottom of this @mfansler! 🙏

Footnotes

  1. halved from 1% because aarch builds are still working

@jakirkham
Copy link
Member

Was debating between "degraded" and "major outage". Ok with using "degraded' instead

That said, this appears to be affecting all(?) native linux_ppc64le builds and the R migration. So it seemed worthy of it in this case

@jakirkham
Copy link
Member

Mervin, have you heard anything from Travis CI?

FWIW it seems Travis users outside conda-forge have the same issue. So it is not just us

@mfansler
Copy link
Member Author

mfansler commented Aug 2, 2024

No word since when I created this. I just sent a ping to see if they have any updates.

@h-vetinari
Copy link
Member

... conda-forge/conda-forge.github.io#1521 ... 🙄

@h-vetinari
Copy link
Member

It's been more than a week. Any affected feedstocks should consider either of the following changes in conda-forge.yml:

  • moving to cross-compilation (might need recipe changes)
    build_platform:
      linux_ppc64le: linux_64
  • or emulation (much slower, but shouldn't need changes)
    provider:
      linux_ppc64le: azure

@dhirschfeld
Copy link
Member

Do you know of any example PR where a recipe was moved to using cross-compilation for linux_ppc64le?

@h-vetinari
Copy link
Member

You mean for R or in general?

@hmaarrfk
Copy link

hmaarrfk commented Aug 9, 2024

xref: conda/conda-build#5349 (just linking here since I tried to move a package out of PPC64le and hit this)

@h-vetinari
Copy link
Member

xref: conda/conda-build#5349 (just linking here since I tried to move a package out of PPC64le and hit this)

That should be a very rare case though. Cross-compilation and noarch: python aren't often mixed, because if an output is actually noarch, it suffices to build it just once (e.g. on linux-64).

@dhirschfeld
Copy link
Member

You mean for R or in general?

In general. The actual feedstock where I'm hitting this is a go recipe.

@mfansler
Copy link
Member Author

mfansler commented Aug 9, 2024

Don't know how much of this will be helpful for other contexts, but here's an example for conversion to cross-compilation on an R feedstock: conda-forge/r-phylobase-feedstock#10

Our recipe (meta.yaml) updates include these changes to build::

  • adding cross-r-base
  • adding all required r-* dependencies
  • all the above go under a # [build_platform != target_platform] selector

For conda-forge.yaml, we use (as already mentioned):

build_platform:
  linux_ppc64le: linux_64
test: native_and_emulated

NB: I usually switch linux_aarch64 to cross-compile as well. If one works they usually both work and the cross-compilation has negligible time difference.

It is not infrequent that we also need to patch the source's build scripts. Since CRAN native builds everything, our upstreams are not always considering cross-compilation, e.g., they use autoconf scripts that include run tests. Often it can be easiest to simply skip such configure scripts and directly provide pre-determined compilation flags.

@mfansler
Copy link
Member Author

mfansler commented Aug 9, 2024

It's been more than a week. Any affected feedstocks should consider either of the following changes in conda-forge.yml:

  • moving to cross-compilation (might need recipe changes)
    build_platform:
      linux_ppc64le: linux_64
  • or emulation (much slower, but shouldn't need changes)
    provider:
      linux_ppc64le: azure

Just want to clarify the explicit combinations here:

build_platform provider CI - Build Mode
linux_ppc64le default Travis CI - native
linux_ppc64le azure Azure - emulate (slow!)
linux_64 default Azure - cross-compile (+ emulated tests)
linux_64 azure Azure - cross-compile (+ emulated tests)

@minrk
Copy link
Member

minrk commented Aug 9, 2024

I can't find any real competitors to Travis for IBM architectures. But I did find that OSU's Open Source Lab hosts (IBM sponsored) Jenkins instances for ppc and s390x for open source. I'm guessing they are not really prepared to handle conda-forge's scale, but it might be worth a contact in any case.

@beckermr
Copy link
Member

beckermr commented Aug 9, 2024

Thanks Min. We had access to those for a long time now. Agreed they are not really for our scale.

@jakirkham
Copy link
Member

Has there been any word from Travis CI on this issue?

@mfansler
Copy link
Member Author

Nothing through my email. I am also unable to view the ticket they created (always ask for "Sign-in" then dumps me on the Dashboard). Maybe someone from Core should take over.

@jakirkham
Copy link
Member

Thanks Mervin! 🙏

Have we seen any Travis CI builds run on linux-ppc64le (including non-R ones)?

@jaimergp
Copy link
Member

Should we edit the title and issue description to reflect this new information?

@mfansler
Copy link
Member Author

@jaimergp there are still other linux-aarch64 jobs passing - I'm not convinced that wasn't a sporadic failure. But if non-R feedstocks are seeing consistent failures, the issue description could be generalized.

@mfansler
Copy link
Member Author

Travis CI reports to have resolved the issue and I have confirmed with several jobs that linux-ppc64le runs are indeed running normally again.

@jaimergp
Copy link
Member

Sounds like we can close this soon, then? Let's keep it open for a few more hours just in case, but will close by EOD if we can confirm it's working.

@jaimergp
Copy link
Member

Checked https://app.travis-ci.com/github/conda-forge and there are several feedstocks with passing builds for both PPC and ARM from few hours ago (e.g. https://app.travis-ci.com/github/conda-forge/databricks-cli-feedstock/builds/272058545?serverType=git). I'll close. Thanks for keeping an eye on this @mfansler!

@jakirkham jakirkham reopened this Aug 27, 2024
@jakirkham
Copy link
Member

Glad this is improving! 🥳

That said, did just see a new instance of this

So doesn't seem like this is fully resolved yet

@h-vetinari
Copy link
Member

Looking at the travis dashboard, this still seems to be happening to ~50% of PPC jobs (which just get cancelled).

@h-vetinari
Copy link
Member

At least that aspect can be cured by restarting the job though.

@mfansler
Copy link
Member Author

Yeah, looks like it's back to the previous baseline with something like 10%-25% sporadic failure.

@jaimergp
Copy link
Member

jaimergp commented Oct 2, 2024

A month has passed and this incident is still open with no foreseeable solution. Are we still observing the 10-25% sporadic failure rate? If that's the case, is it worth studying the feasibility of disabling that platform on Travis CI and let people cross-compile or emulate?

@jakirkham
Copy link
Member

Was about to ask the same thing. Did see a build here. It stalled out in the midst of the build, which is a different issue than this one, but it is an issue that we have seen with Travis CI before

@mfansler
Copy link
Member Author

mfansler commented Oct 2, 2024

R migration bottlenecked on xorg-* migrations for a few weeks (now resolved), so there hasn't been a steady stream of run data for me to estimate recent Travis CI failures rates. However, I'm fine with having this closed, as the acute issue that prompted this report appears over. Should the broader discussion of dropping Travis CI be moved to a dedicated Issue? (where?)

@mfansler mfansler changed the title Travis CI linux-ppc64le jobs failing Travis CI linux-aarch64, linux-ppc64le jobs failing Oct 28, 2024
@mfansler
Copy link
Member Author

Having more data these last few weeks, I still see high failure rates (~10%-25%). I've updated the OP notification to reflect that more specifically - rather than the acute ppc64le issue - and added notice to consider moving to Azure.

@jaimergp
Copy link
Member

jaimergp commented Jan 3, 2025

@martin-g
Copy link

martin-g commented Jan 3, 2025

Maybe it is just me but TravisCI almost always fails for the feestocks I care about ...
Usually I use cross-compilation but few times it did not work because the software uses configure tests/checks: the code snippet is compiled for the target_platform but then executed on the host_platform and the error is something like:

checking for gethostname in -lnsl... no
checking for gethostbyname in -lnsl... no
checking for accept in -lsocket... no
configure: error: cannot run test program while cross compiling
See `config.log' for more details.
Traceback (most recent call last):

Full logs: https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=1131643&view=logs&j=d6b58996-039f-5e48-56bf-c3a016e5cd7f&t=dd6a23d9-2356-5264-64bf-579875a5d8d6
conda-forge/ossuuid-feedstock#9 - the OSX ARM64 build. The aarch64/ppc64le fail the same so I dropped cross-compilation for them but now TravisCI is not available at all:

@mfansler
Copy link
Member Author

mfansler commented Jan 3, 2025

@martin-g note that Linux ARMs always have the option of emulation on Azure - it's just much slower (e.g., ~5X longer last I check), but typically gets around having to modify things. #185 (comment)

I think the default macos-latest runners on Azure are macos-14 running on Apple Silicon - we're just not using them yet. Not sure where the discussion stands on rolling that out here. Possibly mixed up with the question of when CF stops supporting osx-64.

@martin-g
Copy link

martin-g commented Jan 3, 2025

Thanks! I was not aware of the emulation possibility!
I will try it for conda-forge/ossuuid-feedstock#9 ! Slow is still better than nothing!

@theAeon
Copy link

theAeon commented Jan 4, 2025

cross-referencing this with conda-forge/conda-forge-pinning-feedstock#6595

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

10 participants