Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple lines in Parameter descriptions #7

Open
SvenMantowsky opened this issue Mar 11, 2024 · 1 comment
Open

Multiple lines in Parameter descriptions #7

SvenMantowsky opened this issue Mar 11, 2024 · 1 comment

Comments

@SvenMantowsky
Copy link

First of all - hats off to you jwlodek. This is almost the only package I found which is working almost perfect.
Not sure if this package is still maintained but i thought I give it a shot:

I converted several files with np docs in them and discovered a problem, where multiple lines of description are split in several lines in the markdown.

Example doc:

Parameters
----------
model_scores_cali: np.ndarray[float]
    2D-Array containing model outputs in form of a specific score (e.g. softmax). 
    The rows correspond to different data-points and the columns correspond to the 
    classes of the classification task. Note that the calibration data should not 
    have been used for model training.
cali_label: np.ndarray[str | int]
    Contains integer or string ground-truth labels. The i-th entry corresponds to
    the i-th row of model_scores_cali.
model_scores_val: np.ndarray[float]
    Contains 'validation' data with same structure as model_scores_cali. Prediction
    sets will be formed for this data.

Result:

Parameters

Parameter Type Doc
model_scores_cali np.ndarray[float] 2D-Array containing model outputs in form of a specific score (e.g. softmax).
Unknown The rows correspond to different data-points and the columns correspond to the classes of the classification task. Note that the calibration data should not
Unknown have been used for model training. cali_label: np.ndarray[str
Unknown Contains integer or string ground-truth labels. The i-th entry corresponds to the i-th row of model_scores_cali.
model_scores_val np.ndarray[float] Contains 'validation' data with same structure as model_scores_cali. Prediction
Unknown sets will be formed for this data. val_label: None
Unknown If the ground-truth labels of model_scores_val are known, they can be used as input here in order to compute the empirical coverage of correct predictions.

If you have an easy fix for this i would appreciate it very much. But if you no longer maintain this repo - maybe you can point me in the direction where to look so i can save some time and create a PR you could maybe approve.

THX and have a nice week.

@jwlodek
Copy link
Owner

jwlodek commented Mar 13, 2024

Hi, glad you are finding this useful! I basically couldn't find a tool for doing this other than ones that generate full sphinx docs which is complete overkill for smaller projects, which is why I wrote this script over a weekend.

It's been a while since I worked on this, but I think the issue is I basically assumed single line descriptions for parameters.

The code that actually parses each docstring into the data structure that npdoc2md uses internally is here:

npdoc2md/npdoc2md.py

Lines 339 to 380 in 7138063

def add_docstring_to_instance(instance: ItemInstance, doc_string: List[str]) -> None:
"""Function that parses docstring to data structures and adds to instance
Parameters
----------
instance : ItemInstance
current instance
doc_string : list of str
Current instance's docstring as list of lines
"""
current_descriptor = None
i = 0
while i < len(doc_string):
left_stripped = doc_string[i].lstrip()
stripped = doc_string[i].strip()
if i == 0:
instance.set_simple_description(stripped.replace('"""', ''))
elif stripped not in docstring_descriptors.keys() and current_descriptor is None:
instance.add_to_detailed_description(left_stripped)
elif stripped in docstring_descriptors.keys():
current_descriptor = stripped
elif current_descriptor is not None and not stripped.startswith('---') and len(stripped) > 0:
descriptor_elem = []
if len(docstring_descriptors[current_descriptor]) == 3:
name_type = stripped.split(':')
if len(name_type) == 1:
name_type.insert(0, 'Unknown')
descriptor_elem = descriptor_elem + name_type
else:
descriptor_elem.append(stripped.split('(')[0])
i = i + 1
try:
descriptor_elem.append(doc_string[i].strip())
except IndexError:
descriptor_elem.append('Unknown')
if current_descriptor not in instance.descriptors.keys():
instance.descriptors[current_descriptor] = [descriptor_elem]
else:
instance.descriptors[current_descriptor].append(descriptor_elem)
i = i + 1

It essentially takes as input the docstring as a list of lines, and the reference to the internal ItemInstance object that represents this docstring.

Then, it loops over all the lines in the docstring, and based on what it expects to see, parses them into the format that ItemInstance expects. So for example, when it sees one of the numpy descriptors, i.e. Parameters, Returns etc., it adds a descriptor to the ItemInstance, and then for each subsequent line after the ---* it will add a name/type combo in one line, followed by the description in the next.

I think what we'd need to do is figure out a different way to determine whether or not there is a new entry in the docstring (maybe by using the number of indents?) and until we see a new entry keep appending to the current description.

I can take a look at this maybe over the weekend, or if you'd like to have a crack at it, I'd be happy with that.

There are probably some overall improvements that could be made here as well - I basically wrote this in one sitting and got it to a point that it works but never went back to clean things up and make things more "proper" or readable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants