Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add extra safeguards for basic auth #765

Open
goeffthomas opened this issue Nov 20, 2024 · 0 comments
Open

Add extra safeguards for basic auth #765

goeffthomas opened this issue Nov 20, 2024 · 0 comments

Comments

@goeffthomas
Copy link
Contributor

As part of #245, basic auth support was added for the fetching of raw data files. During a review of using mlcroissant for data loading functionality, we realized that there's a data exfiltration vulnerability. The exploit could look something like this:

  • Bad actor hosts a properly formatted Croissant JSON-LD. Maybe from a reputable-seeming location like a public GCS/S3 URL.
  • The distribution property in the JSON-LD contains contentUrl values that point to a server under their control.
  • Bad actor publishes an example notebook of how to use Croissant to load data. It would:
    • Describe how to set the CROISSANT_BASIC_AUTH_* values to access a private dataset on kaggle.com, where victims would provide their own username and API key.
    • Use some means to load the private Kaggle data--either via the load script or by doing something like mlc.Dataset(private_kaggle_dataset_url).records(record_set=name_of_tabular_file)--which would use the Kaggle credentials in the requests for data. Note that this step isn't necessary for exfiltration, but establishes a reason for why the victim's credentials would be needed in the first place.
    • Without clearing environment variables, then load data from the from the bad actor's Croissant. Because of the presence of the same values in the environment variables, the victim's credentials would be sent to the server under the attacker's control. The attacker could store those credentials and use them at a later time. If the attacker's server returns data that allows the notebook to complete as expected, this could go undetected for a while, allowing the attacker to amass credentials for several victims.
    • Kaggle is the example here, but this could be the case for other data providers that support basic auth as a means for retrieving data behind ACLs.

Solutions:

  • We could introduce a 3rd new environment variable to serve as a safeguard. Something like CROISSANT_BASIC_AUTH_HOST could be used to ensure that requests for data are only permitted to use the CROISSANT_BASIC_AUTH_(USERNAME|PASSWORD) for URLs that match on that domain.
    • This is maybe preferred since we already have environment-based authentication and the implementation is simple
  • Authentication values could be accepted at the point of loading data so that each instantiation/invocation supplies the values rather than reading from the environment.
    • Aside from being more complicated, this approach is also slightly riskier: it would require changing more of the code (higher chance of regression) and because of the top-down approach of passing in values, there's a higher risk of missing a path/branch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@goeffthomas and others