Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Travis is dead #1521

Open
h-vetinari opened this issue Oct 12, 2021 · 20 comments
Open

Travis is dead #1521

h-vetinari opened this issue Oct 12, 2021 · 20 comments

Comments

@h-vetinari
Copy link
Member

Aside from the previous instabilities and random timeouts, it now seems to fail even trivial (& fundamental) tasks like downloading binaries consistently (4x in a row over the course of 24h in conda-forge/openblas-feedstock#122, each time across all 4 jobs in the build)

Downloading https://github.com/xianyi/OpenBLAS/archive/v0.3.18.tar.gz
Error: HTTP 000 CONNECTION FAILED for url <https://github.com/xianyi/OpenBLAS/archive/v0.3.18.tar.gz>

I don't know when this will resolve itself, but just thought I'd open an issue to discuss the situation (e.g. move all PPC builds to azure?).

@jakirkham
Copy link
Member

There was at least one recent change for Travis ( conda-forge/conda-forge-ci-setup-feedstock#167 ). Maybe more are needed? Honestly would bring this up on the gitter channel

@jakirkham
Copy link
Member

Hmm...it seems that there are different builds that did work recently. Is it possible that something else is going on like that link is bad or Travis has trouble with GitHub URLs specifically or something is wonky around redirects?

@h-vetinari
Copy link
Member Author

To be honest, after a year of flakiness, I have very limited patience left for debugging Travis... For the feedstocks I maintain more or less by myself, I'll see if emulation through azure works, otherwise I'll start dropping PPC.

@h-vetinari
Copy link
Member Author

h-vetinari commented Oct 13, 2021

Especially given the miniscule fraction that PPC builds represent of the whole ecosystem (taking scipy as a more or less representative example, it was less than 1 in 1000 recently).

@h-vetinari
Copy link
Member Author

I'm wondering if github is blocking travis from downloading sources. I can't believe that this is so consistently failing (all jobs I've seen recently, and every run therein) just due to flaky networks.

@jakirkham
Copy link
Member

jakirkham commented Oct 15, 2021

Yeah this is what I'm wondering as well. It could be some GH rate limiting for example

@carlodri
Copy link
Contributor

@mbargull
Copy link
Member

For the rust feedstock @pkgw found ( conda-forge/rust-feedstock@0b3add7 , conda-forge/rust-feedstock#94 (comment) ) that --network=host for the build container on Travis seems to resolve the issue.

pkgw added a commit to pkgw/conda-smithy that referenced this issue Oct 22, 2021
…flags

In at least some cases, it looks like this helps with various network
connection timeouts and failures, as per
conda-forge/conda-forge.github.io#1521 and
conda-forge/rust-feedstock#94.
pkgw added a commit to pkgw/conda-smithy that referenced this issue Oct 22, 2021
…flags

In at least some cases, it looks like this helps with various network
connection timeouts and failures, as per
conda-forge/conda-forge.github.io#1521 and
conda-forge/rust-feedstock#94.
@jakirkham
Copy link
Member

Marcel added PR ( conda-forge/conda-smithy#1520 ) to conda-smithy, which includes a similar fix as was used for Rust

@h-vetinari
Copy link
Member Author

... that --network=host for the build container on Travis seems to resolve the issue.

That is a really big hammer and has substantial security implications. It's absolutely not recommended by any security-minded person in the container space (notwithstanding the internet being full of blog posts that "just want to get it working"). See e.g. point 19 here

@jakirkham
Copy link
Member

jakirkham commented Oct 22, 2021

We are open to better suggestions 🙂

Edit: More discussion on this starting here ( conda-forge/conda-smithy#1520 (comment) )

@h-vetinari
Copy link
Member Author

Edit: More discussion on this starting here ( conda-forge/conda-smithy#1520 (comment) )

I prefer to answer here, because the PR is closed already, and it fits this discussion better.

@jakirkham: As noted in the other issue, we are open to better suggestions 🙂

That said, would be curious to know what specifically we are concerned about? Could you please outline a scenario where something specific happens that we would like to protect against?

For example, if the concern is something like people can breakout of the container into the Travis VM, I don't think we are that worried about that kind of issue (it is after all still on a VM on Travis). We don't pass this flag when building locally. So there's not risk of that

Yeah, I realise that there's a nesting of VMs/containers happening that mitigates the amount of damage that can realistically be done, but just in and of itself --net=host is a red flag. Now that it's merged (and "working"), the most likely outcome is that it'll be forgotten, and even if other constraints currently preclude further problems, those constraints can evolve over time. It might not ever become a problem, but it's a hack with huge tech & security debt, and no perspective for it to be fixed.

We are open to better suggestions 🙂

Travis should fix whatever caused the regression on their side that things working before (without --net=host) are not working anymore. Barring that (because Travis as a company really looks like its dying), conda-forge should remove Travis as a CI option, and move all PPC builds to azure.

@h-vetinari
Copy link
Member Author

(another comment from the other thread)

@pkgw: Yeah. Because this option is only kicking in on the Travis builds which are already occurring inside the containment provided by Travis, I struggle to see how setting this opens us up to any new risks. Rather than breaking out of our container, I think the worry would be someone else influencing the network settings to affect what happens inside the container, but we already run that risk, since the first thing we do is pull down the Docker image that guides the whole build process. (Or to put it another way, if that was a workable attack vector, literally everyone using Travis would be at risk.)

Funny you should say that ("literally everyone using Travis would be at risk.")... I usually don't rag on other people's projects, but travis has been shit-tier garbage in recent times, and their response to catastrophic security issues (forcing conda-forge/core to rotate secrets on all affected feedstocks) was so spectacularly bad that they should really be dropped like a hot potato.

@jakirkham
Copy link
Member

We are open to better suggestions 🙂

Travis should fix whatever caused the regression on their side that things working before (without --net=host) are not working anymore. Barring that (because Travis as a company really looks like its dying), conda-forge should remove Travis as a CI option, and move all PPC builds to azure.

Great I've tried to summarize that in this comment. Please feel free to add anything else there

@pkgw
Copy link
Contributor

pkgw commented Oct 22, 2021

Fully agree that (1) we should move away from Travis swiftly and (2) Docker net=host is to be treated with great care at a minimum and avoided when possible. I do, however, continue to believe that in this particular application it is not opening up any new risks.

@mbargull

This comment has been minimized.

@h-vetinari

This comment has been minimized.

@mbargull
Copy link
Member

Due to the cloud.drone.io breakage Travis is now also used for linux-aarch64 if one uses native build platform in conda-forge.yml. (That is just until we have a working usable alternative service running for ARM builds again.)

Now, apparently for that platform there is some issue (misconfiguration?) with seccomp, which leads to non-working sudo in the containers (used for sudo yum install for yum_requirements.txt) 😒. We encountered this on conda-forge/python-feedstock#523 and conda-forge/python-feedstock@0f13bea shows that with seccomp=unconfined that issue is "resolved".
Meaning, one could dig into/fix the seccomp profile on Travis' ARM workers to add a proper fix.
But at least the python-feedstock build was either way killed after some minutes because The job exceeded the maximum log length, and has been terminated. (i.e., conda-build's $PREFIX/$BUILD_PREFIX/$SRC_DIR replacement would need some fixes/be more aggressive -- running on stderr too and support outputs/script too). So we'd to turn multiple knobs for the (hopefully) temporary usage of Travis for ARM builds.
My recommendation for now is to use the emulated builds on Azure wherever possible.

@h-vetinari
Copy link
Member Author

h-vetinari commented Jan 8, 2025

I've seen various feedstocks where jobs on travis silently aren't being started at all anymore, giving the impression the CI is green while a chunk of the matrix was never built. This is arguably even worse than failing jobs (conda-forge/status#185 didn't get fixed for months either), c.f. also conda-forge/conda-forge-pinning-feedstock#6595.

All in all:

Image

@beckermr
Copy link
Member

beckermr commented Jan 8, 2025

Yeah I'm in agreement here. Travis seems to have almost completely rotted away. I'm sick of looking at the status issue. We should remove it totally and push any jobs that cannot cross-compile into emulation, possibly on the long cpu queue on the gpu server if they need more than six hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

6 participants