Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We should be able to convert between different ontological code vocabularies. #204

Open
mmcdermott opened this issue Oct 11, 2024 · 4 comments · May be fixed by #206
Open

We should be able to convert between different ontological code vocabularies. #204

mmcdermott opened this issue Oct 11, 2024 · 4 comments · May be fixed by #206
Assignees

Comments

@mmcdermott
Copy link
Owner

mmcdermott commented Oct 11, 2024

Target: We should be able to take a MEDS dataset (with parent-code entries in the metadata) and run a script to map codes in one omop ontology space (e.g., ICD9) to another (e.g., ICD10) using standardized vocabulary mapping tables (e.g. OHDSI vocabulary concept relationship tables)

This will entail two steps (yet to be determined on how to localize into actual stages):

  1. Take the codes.parquet metadata file, use the vocabulary relationships to remap in the parent code space into the target output space. Store the original code string and the updated code string in some pre-set format. (with codes not in the vocabulary conversion step omitted)
  2. Leverage the updated codes.parquet with the original and new code columns to perform a one-to-many mapping from the original shards to shards where the codes have been remapped (with codes not in the vocabulary conversion step omitted).
@mmcdermott
Copy link
Owner Author

mmcdermott commented Oct 11, 2024

Most similar existing stage is vocabulary ID creation and assignment / tokenization:
Similar two steps:

  1. Map each code string into a integer vocab ID (all in codes.parquet): https://github.com/mmcdermott/MEDS_transforms/blob/main/src/MEDS_transforms/fit_vocabulary_indices.py
  2. (alongside other normalization steps) join codes to vocab IDs and convert via the metadata file: https://github.com/mmcdermott/MEDS_transforms/blob/main/src/MEDS_transforms/transforms/normalization.py

@mmcdermott
Copy link
Owner Author

Open question: How to download/store/access the ohdsi vocab remapping tables?

@prenc prenc self-assigned this Oct 17, 2024
@prenc prenc linked a pull request Oct 17, 2024 that will close this issue
@prenc
Copy link
Collaborator

prenc commented Oct 17, 2024

I wonder if the pipeline's first step should consider that there might be code/vocab_index in codes.parquet and account for that. If we translate m:m, we will most likely introduce a new word in the vocabulary that will require a new vocabulary index. Is it already assumed that the vocabulary fitting will occur later in the whole pipeline?

And @mmcdermott , could you look at the file structure of what I have added so far to confirm if this complies with the framework?

@mmcdermott
Copy link
Owner Author

Thanks for the nudge @prenc ; I will try to take a look later today! Also, yes, we should assume that vocabulary fitting will occur later in the pipeline so we do not need to worry about that at this stage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants