Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features related to API and Suggestion #124

Open
Mattral opened this issue Jan 12, 2025 · 1 comment
Open

Features related to API and Suggestion #124

Mattral opened this issue Jan 12, 2025 · 1 comment
Labels
discussion Community talk on a given topic

Comments

@Mattral
Copy link

Mattral commented Jan 12, 2025

Feature Request: API Enhancements for GitHub Content Integration

Overview

May I propose the addition of several new features to the API that will improve its usability and flexibility when working with GitHub content. These features aim to enhance how users can access, process, and fine-tune their data for use with Large Language Models (LLMs). Below is a summary of the proposed features and corresponding example endpoints.

Features

1. OAuth Integration

Allow users to authenticate via GitHub OAuth, enabling the app to access private repositories or perform actions on behalf of users securely.

  • Expected Behavior: Users should be able to authenticate using GitHub OAuth tokens, allowing the API to access repositories (public and private) and interact with them based on the permissions granted by the user.

  • Example Flow:

    • User authenticates via OAuth.
    • Access to specific repositories or actions on behalf of the user is granted.

2. Specific File Retrieval

Provide functionality for users to retrieve specific files within a repository (e.g., markdown, code, or README files). Users should be able to specify the path to the desired file.

  • Example Endpoint:
    GET /repositories/{owner}/{repo}/files/{path}
    This endpoint allows users to retrieve a specific file by its path within the repository.

  • Expected Behavior:

    • Users can request specific files by providing the file path (e.g., /docs/README.md).
    • The returned file can be in its original format (markdown, text, etc.).

3. Branch or Commit Specific Retrieval

Allow users to specify branches or commit SHAs to extract content from, enabling precise data extraction based on the version or state of the repository.

  • Example Endpoint:
    GET /repositories/{owner}/{repo}/commits/{sha}/files

  • Expected Behavior:

    • Users can specify a commit SHA or branch to retrieve content from a particular point in the repository's history.
    • This provides version-specific data retrieval for more granular content extraction.

4. Data Transformation for LLM Consumption

Convert GitHub content into a format suitable for LLM training (e.g., structured data like JSON, plain text, etc.). Users should be able to specify the format that best fits their needs.

  • Example Endpoint:
    POST /transform
    This endpoint should allow users to submit GitHub content and specify their desired output format for LLM use.

  • Expected Behavior:

    • The API processes the content and returns it in a format suitable for further processing (such as plain text, JSON, or other structured formats).

5. Language Detection for Multi-lingual Content

Automatically detect the language of the content to facilitate easier processing and fine-tuning of LLM models based on language-specific training data.

  • Expected Behavior:
    • When content is retrieved or transformed, the API will automatically detect the language and return this information as metadata.
    • This will help users tailor their LLM models by providing the language context of the repository content.

6. Version History and Comparisons

Version Control for Transformed Data

Track the version history of transformed data, allowing users to request previous versions of the transformed content.

  • Example Endpoint:
    GET /repositories/{owner}/{repo}/transformations/{version}
    This will allow users to retrieve specific versions of the transformed content.

Compare Different Versions

Enable users to compare transformed data between different versions, commits, or branches.

  • Example Endpoint:
    GET /repositories/{owner}/{repo}/compare/{sha1}/{sha2}
    This endpoint will return the differences between two versions of transformed data, based on commits or branches.

  • Expected Behavior:

    • This will help users see how the transformed content has evolved across different versions, aiding in quality control and model training consistency.

7. LLM Fine-tuning Integration

Provide users the option to export their transformed GitHub content in a format suitable for fine-tuning an LLM (e.g., OpenAI's fine-tuning format).

  • Example Endpoint:
    POST /repositories/{owner}/{repo}/export/finetuning
    This will allow users to export the transformed data for fine-tuning purposes.

  • Expected Behavior:

    • The transformed content will be formatted in a way that is compatible with LLM fine-tuning, enabling users to train or adapt models using their own GitHub data.

Summary

These enhancements will significantly expand the capabilities of the API, providing users with more control over their GitHub data, enabling version control, multi-lingual content processing, and direct integration with LLM fine-tuning workflows.

I look forward to community feedback and contributions to improve the functionality of this API. Please feel free to contribute to this discussion or open issues related to specific features.

Next Steps

  • Discuss and approve feature requests.
  • Begin implementing OAuth authentication and content retrieval endpoints.
  • Ensure compatibility with LLM fine-tuning pipelines and provide user documentation.

Thanks for considering these improvements!

@cyclotruc cyclotruc added the discussion Community talk on a given topic label Jan 13, 2025
@cyclotruc
Copy link
Owner

thank you for taking the time to make this!

here are the features that are already implemented or planned in the roadmap:
1 OAuth Integration (not ready yet)
2. Specific File Retrieval (should be working via URL)
3. Branch or Commit Specific Retrieval (Should be working via URL)
4. Data Transformation for LLM Consumption (Multiple output format are planned, like json, xml or plain text)
6. Version History and Comparisons (Planned)

  1. Language Detection for Multi-lingual Content:
    Maybe if we find a non-AI way to detect language this could be doable, but anything to heavy might be overkill compared to the benefit or this information

  2. LLM Fine-tuning Integration
    I'm not sure I have the right context on this one, could you share an example of the format you're talking about?

thanks a lot

@filipchristiansen filipchristiansen changed the title Features related to API and Suggestion, Features related to API and Suggestion Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Community talk on a given topic
Projects
None yet
Development

No branches or pull requests

2 participants