Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process comments (and extra nodes) separately from the rest of the code #500

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

nbacquey
Copy link
Member

@nbacquey nbacquey commented Jun 7, 2023

This PR is an attempt at processing the comments separately from the rest of the code, so that formatting queries can be both resilient to the presence of random comments, and simple to write.

Identifying comments

This is a challenge in itself:

  • We can't use the same mechanism we use for leaf identification, i.e. a formatting directive in the query file that would look like (block_comment) @comment. Indeed, it would mean either running the query twice, or accounting for the presence of comments in the rest of the query file, both situations we want to avoid.
  • The current implementation selects comment nodes from the CST with the following heuristics:
    node.is_extra() && node.kind().to_string().contains("comment"). I think it should capture all types of comments for all supported languages, but I haven't checked that yet.
  • Another option would be to have a separate query file for comments, one for each language, which would contain selecting and formatting directives for comments. It could also contain instructions on how to re-insert comments back in the code, once it's formatted.

Anchoring comments

Each comment should be anchored to a non-comment node of the code. Ideally, it should be the node it's supposed to comment, but such semantics can't be deduced from the CST alone. This PR uses the following heuristics instead:

  • If the comment node is alone on its line, anchor it to its next sibling in the CST. If it's the last of its siblings, anchor it to its previous sibling instead.
  • If the comment node isn't alone on its line, anchor it to its previous sibling in the CST. If it's the first of its siblings, anchor it to its next sibling instead.
  • I haven't thought at what it would mean for a comment node to be an only child in the CST, nor if that case can actually happen. In that case, it would probably be a safe bet to anchor it to its parent node.

Extracting comments

tree-sitter grammars and queries don't offer lots of tools to edit an existing CST. The only reasonable way I've found is to use input.replace_range() to remove the comment's bytes from the input, then tree.edit() to mark a node as edited in the CST, then finally reparse(old_tree, new_input, grammar) to get the new CST, without comments. The query file would then be applied to this new CST.

Re-inserting comments

This is a part I haven't had time to experiment with, but I think re-insertion should be done after processing all append/prepend directives, but before post-processing. I imagine something like this should work:

  • If the anchor was before the comment, re-insert with (anchor).(space).(comment) (line breaks should already have been taken care of).
  • If the anchor was after the comment, re-insert with (comment).(line break).(anchor).

State of the PR

  • Identifying comments
  • Anchoring comments
  • Extracting comments
  • Re-inserting comments
  • Add more documentation where needed
  • Update README

@314eter
Copy link

314eter commented Jun 8, 2023

In order to fix #430, this should be done for OCaml attributes too. They are not included with the current heuristic.

Attributes can appear almost everywhere in OCaml. To avoid overly complicating the grammar, tree-sitter-ocaml really allows them everywhere, even in some invalid places. If you replace the comments in the example of #489 with attributes, the idempotency rule will still be violated (the input will be parseable by tree-sitter, but the result will not be). But since the code was invalid OCaml to begin with, that's not a huge problem.

@ErinvanderVeen
Copy link
Collaborator

I think this heuristic can be avoided by using the languages.toml file.

@ErinvanderVeen ErinvanderVeen force-pushed the nb/engine/split_comment_stream branch from 48bfcc0 to f5da50b Compare August 22, 2023 13:20
@ErinvanderVeen ErinvanderVeen force-pushed the nb/engine/split_comment_stream branch from f5da50b to 936a925 Compare September 5, 2023 12:29
@nbacquey nbacquey force-pushed the nb/engine/split_comment_stream branch from 936a925 to 43a106e Compare November 7, 2024 17:21
@nbacquey nbacquey force-pushed the nb/engine/split_comment_stream branch from 43a106e to 74199c4 Compare November 20, 2024 15:32
Copy link
Member

@Xophmeister Xophmeister left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly follow this and it feels like it's on the right track. I did a commit-wise review, so some of my suggestions against earlier commits may no longer apply, or apply elsewhere.

I appreciate the modularisation of comments and types and the comments you added to the source to describe what's going on. This is quite a tricky operation, so those comments are super-encouraged to save our future selves' sanity 😅

topiary-core/src/tree_sitter.rs Outdated Show resolved Hide resolved
topiary-core/src/tree_sitter.rs Outdated Show resolved Hide resolved
topiary-core/src/comments.rs Show resolved Hide resolved
topiary-core/src/comments.rs Show resolved Hide resolved
@nbacquey nbacquey force-pushed the nb/engine/split_comment_stream branch 3 times, most recently from d566486 to edc5e08 Compare December 4, 2024 09:15
@nbacquey nbacquey force-pushed the nb/engine/split_comment_stream branch 3 times, most recently from 723f95f to 5fb3642 Compare December 11, 2024 11:17
@nbacquey nbacquey force-pushed the nb/engine/split_comment_stream branch 4 times, most recently from 0b8f39b to 765dda7 Compare December 17, 2024 17:08
@nbacquey nbacquey force-pushed the nb/engine/split_comment_stream branch 2 times, most recently from 2a0f34b to c20032b Compare December 19, 2024 11:01
If there were the following atoms in succession:
`[Space, Hardline, IndenStart]`
then the function would process them as
`[Empty, Hardline, IndentStart]`
instead of
`[Empty, IndentStart, Hardline]`
because the whitespace collation code would skip a step.
@nbacquey nbacquey force-pushed the nb/engine/split_comment_stream branch from c20032b to 0e1d3a1 Compare January 9, 2025 11:38
@nbacquey nbacquey force-pushed the nb/engine/split_comment_stream branch from a2990d4 to e2a01c9 Compare January 9, 2025 12:27
@nbacquey nbacquey changed the title WIP: Process comments (and extra nodes) separately from the rest of the code Process comments (and extra nodes) separately from the rest of the code Jan 9, 2025
@nbacquey nbacquey marked this pull request as ready for review January 9, 2025 21:46
@nbacquey nbacquey force-pushed the nb/engine/split_comment_stream branch 4 times, most recently from 019848b to 8d48d37 Compare January 13, 2025 10:26
Instead of using ad hoc logic to identify comment nodes, this commit
uses a separate query file.
@nbacquey nbacquey force-pushed the nb/engine/split_comment_stream branch from 8d48d37 to 7a31f23 Compare January 13, 2025 14:10
Copy link
Member

@Xophmeister Xophmeister left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is as awesome as it is epic! Amazing work 🙇

  • I've gone through this pretty thoroughly, but I must confess to not fully understanding everything. That said, you've named things very clearly, so I've used those names as a crux wherever I don't follow. I've added comments if/when this "method" breaks down a bit.

  • The PR description is very clear and I think -- if updated to reflect some of the changes since it was written -- it would make for good documentation in the README. (I've added some notes about the PR description below.)

  • With regard to line-embedded comments /* like this */ what's the reason why they can't be anchored to the node (or the sibling node) immediately before the comment?

    (Similarly, for "orphaned" comments: Can they just assume the formatting of the current context, appended with a hardline, or is that not a safe fallback?)

  • Meta: It's a real shame that Tree-sitter doesn't have a way, AFAIK, to introspect grammars. It would be really nice if one could just get a list of all "extra" nodes in any grammar, then you wouldn't need to bother with the language.comment.scm files... Perhaps worth an issue/feature request in the Tree-sitter repo 🤷

  • Meta: @Niols, would you be able to experiment with this branch against your OCaml codebase(s), paying attention to how it affects comment formatting? It would be really interesting to see how it plays with real world code.


Notes on the PR description, so it can be converted into documentation easily:

Identifying comments

  • It would be good to expand on why this isn't ideal, for the sake of the documentation:

    • We can't use the same mechanism we use for leaf identification, i.e. a formatting directive in the query file that would look like (block_comment) @comment. Indeed, it would mean either running the query twice, or accounting for the presence of comments in the rest of the query file, both situations we want to avoid.
  • Update this to match that you've gone with the second option; that is, comment query files:

    • The current implementation selects comment nodes from the CST with the following heuristics:
      node.is_extra() && node.kind().to_string().contains("comment"). I think it should capture all types of comments for all supported languages, but I haven't checked that yet.
    • Another option would be to have a separate query file for comments, one for each language, which would contain selecting and formatting directives for comments. It could also contain instructions on how to re-insert comments back in the code, once it's formatted.

Anchoring comments

  • Your heuristics make perfect sense to me and seem eminently sensible. Some ASCII-art diagrams for the first two cases would make this section ideal for the documentation.

    It would also be worth documenting the:

    • interleaved comment edge-case (modulo my question, above);
    • and the case that the OCaml sample exposes, when comments are put after their node.
  • Presumably, the second sentence in this heuristic isn't implemented:

    • If the comment node is alone on its line, anchor it to its next sibling in the CST. If it's the last of its siblings, anchor it to its previous sibling instead.

    That's why we see final child comments losing their indent level, for example.

  • It looks like you bail out in the only child case:

    • I haven't thought at what it would mean for a comment node to be an only child in the CST, nor if that case can actually happen. In that case, it would probably be a safe bet to anchor it to its parent node.
    $ cargo run -- fmt -lbash <<'SH'
    > (
    >   # foo
    > )
    > SH
    [2025-01-13T16:10:48Z ERROR topiary] Found an anchored comment, but could not attach it back to its anchor
        # foo
        The anchor was CommentedAfter { section: InputSection { start: Position { row: 2, column: 3 }, end: Position { row: 2, column: 3 } }, blank_line_after: false, blank_line_before: false }

    I agree with you that instances of only child comments are likely to be rare. Perhaps an option could be to just assume the formatting imposed by the parent as a fallback. Does that even make sense?

Re-inserting comments

  • I think this reads well; it just needs to be tweaked to change from the subjunctive to the indicative.

}

// Empty block inside of impl function
impl MyTrait for MyStruct {
fn foo() {
// ... logic ...
// ... logic ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not a case of an only child comment? How come this one doesn't bail out?

Comment on lines +1026 to +1027
; just doing _ above doesn't work, because it matches the final named node as
; well as the final non-named node, causing double indentation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the second sentence of this part of the heuristic happen?

If the comment node is alone on its line, anchor it to its next sibling in the CST. If it's the last of its siblings, anchor it to its previous sibling instead.

I haven't looked at the implementation yet, so perhaps that part is not done.

@@ -0,0 +1,2 @@
; Identify nodes for comment processing
(comment) @comment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this PR is designed specifically for comments, but I wonder if it can apply more generally to other extra nodes, like OCaml attributes? @314eter's comment suggests that the heuristic you've defined wouldn't be sufficient, so perhaps not...

Comment on lines 195 to +196
// The user specified a query file
Some(p) => p,
Some(p) => (p, None),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if --query is used, then the comment query is None? That feels a bit surprising and that ought to be clearly documented.

Alternatively, what do you think about: If the query file is p, then the comment query file is:

  • Some(q), where q := replace(p, '${language}.scm', '${language}.comment.scm'), if q exists.
  • None, otherwise.

Similar to what you do in topiary-config/src/language.rs... Too magical?

@@ -13,6 +13,9 @@ pub struct Language {
/// The Query Topiary will use to get the formating captures, must be
/// present. The topiary engine does not include any formatting queries.
pub query: TopiaryQuery,
/// The Query Topiary will use to determine which nodes are comments.
/// When missing, ther ewill be no separate comment processing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// When missing, ther ewill be no separate comment processing.
/// When missing, there will be no separate comment processing.

Comment on lines +43 to +46
/// Maps node IDs to comments before that node. Comments are stored in reading order.
comments_before: HashMap<usize, VecDeque<Comment>>,
/// Maps node IDs to comments after that node. Comments are stored in reading order.
comments_after: HashMap<usize, VecDeque<Comment>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All comments before/after, or just those immediately before/after?

Comment on lines +573 to +574
// REVIEW: the double borrow is sort of weird, is this idiomatic?
if node_section.contains(&(&comment).into()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't speak to whether it's idiomatic (@yannham?); however, it makes sense here:

  • comment is an AnchoredComment and you only implement From<&AnchoredComment> for InputSection; hence the first reference. You need to use comment later, so taking a reference makes sense otherwise you'll get a borrow violation.
  • InputSection::contains's argument expects a reference; hence the second. Here, the converted InputSection is only used once, so maybe one could implement From<&AnchoredComment> for &InputSection, but that would probably require careful lifetime annotations.

My opinion is that the double-& is probably a better option than an additional, possibly complex impl that's only called once.

Comment on lines +575 to +590
match comment.commented {
Commented::CommentedAfter { .. } => self
.comments_before
.entry(id)
.or_default()
.push_back((&comment).into()),
Commented::CommentedBefore(_) => self
.comments_after
.entry(id)
.or_default()
.push_back((&comment).into()),
}
} else {
comments.push(comment);
break;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All comments before/after, or just those immediately before/after?

I believe this code answers my question: It's the just comments immediately before/after...I think.

Comment on lines +934 to +1054
let comment_atom = Atom::Leaf {
content,
id: self.next_id(),
original_column,
single_line_no_indent: false,
multi_line_indent_all: true,
};
comments_queue_with_context.push_front(CommentWithContext {
comment: comment_atom,
blank_line_after,
blank_line_before,
});
}
atoms_with_comments_before.push_front(atom);
} else if Atom::Hardline == atom || Atom::Blankline == atom {
let mut blank_line_before_first_comment = false;
// Prepend the comments, each one followed by a newline
while let Some(CommentWithContext {
comment,
blank_line_after,
blank_line_before,
}) = comments_queue_with_context.pop_back()
{
if blank_line_after {
atoms_with_comments_before.push_front(Atom::Blankline)
} else {
atoms_with_comments_before.push_front(Atom::Hardline);
}
atoms_with_comments_before.push_front(comment);
blank_line_before_first_comment = blank_line_before;
}
if blank_line_before_first_comment && atom == Atom::Hardline {
atoms_with_comments_before.push_front(Atom::Blankline);
} else {
atoms_with_comments_before.push_front(atom);
}
} else {
atoms_with_comments_before.push_front(atom);
}
}
let mut blank_line_before_first_comment = false;
// If we still have comments left, add them at the beginning of the file
while let Some(CommentWithContext {
comment,
blank_line_after,
blank_line_before,
}) = comments_queue_with_context.pop_back()
{
if blank_line_after {
atoms_with_comments_before.push_front(Atom::Blankline)
} else {
atoms_with_comments_before.push_front(Atom::Hardline);
}
atoms_with_comments_before.push_front(comment);
blank_line_before_first_comment = blank_line_before;
}
if blank_line_before_first_comment {
atoms_with_comments_before.push_front(Atom::Blankline);
}

// Second pass: get atoms in reverse order, put comments that come after atoms at the correct position
let mut atoms_with_all_comments: VecDeque<Atom> = VecDeque::new();
let mut comments_queue: VecDeque<Atom> = VecDeque::new();
while let Some(atom) = atoms_with_comments_before.pop_front() {
if let Atom::Leaf { id, .. } = atom {
while let Some(Comment {
content,
original_column,
// We don't care about blank lines here: they only matter for comments that
// come before atoms.
..
}) = self.comments_after.entry(id).or_default().pop_front()
{
let comment_atom = Atom::Leaf {
content,
id: self.next_id(),
original_column,
single_line_no_indent: false,
multi_line_indent_all: true,
};
comments_queue.push_front(comment_atom);
}
}
if Atom::Hardline == atom || Atom::Blankline == atom {
// Append the comments, each one preceded by a space
while let Some(comment) = comments_queue.pop_back() {
atoms_with_all_comments.push_back(Atom::Space);
atoms_with_all_comments.push_back(comment);
}
}
atoms_with_all_comments.push_back(atom);
}
// If we still have comments left, add them at the end of the file
while let Some(comment) = comments_queue.pop_back() {
atoms_with_all_comments.push_back(Atom::Space);
atoms_with_all_comments.push_back(comment);
}

self.atoms = atoms_with_all_comments.into()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of see what's going on here, but I find it very hard to follow. Your comments help, but I think this big function would benefit from more documentation on the algorithm and the data structures that are being used.

self.post_process_inner();
// Comments must be processed before deletes, because comment anchors might be deleted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that happens, then does that result in an orphaned comment error?

@Niols
Copy link
Member

Niols commented Jan 21, 2025

At this point:

let dispatch = function
  | None -> (* FIXME: 404 page *) assert false

let () = ()

transforms into:

let dispatch = function
  | None -> assert false (* FIXME: 404 page *)

let () = ()

and then:

let dispatch = function
  | None -> assert false (* FIXME: 404 page *)
let () = ()

leading to an idempotency error.

@Niols
Copy link
Member

Niols commented Jan 21, 2025

type t =
  (
    string,
    string,
    (string * person Slug.t list),
    (* bool, *)
    string,
    string,
    unit, (* FIXME *)
    string
  ) gen
[@@deriving yojson]

gets formatted by current main as:

type t =
  (string, string, (string * person Slug.t list), string, string,
  (* bool, *)
  unit, (* FIXME *)
  string) gen
[@@deriving yojson]

Both the former and the latter get formatted by this PR as:

type t =
  (* bool, *)
  (string, string, (string * person Slug.t list), string, string, unit, string) gen (* FIXME *)
[@@deriving yojson]

Related to #816 for the one-line-ness of the result.

@Niols
Copy link
Member

Niols commented Jan 21, 2025

if foo (* youpi *)
then bar
else baz

becomes

if foo then bar (* youpi *)
else baz

And same without else baz.

@Niols
Copy link
Member

Niols commented Jan 21, 2025

Last comment of a file gets put on the same line as the last non-comment, eg.:

let foo = ()
(** youpi foo *)

let bar = ()
(** youpi bar *)

becomes:

let foo = ()
(** youpi foo *)

let bar = () (** youpi bar *)

This happens even with an empty line before (** youpi bar *)

@Niols
Copy link
Member

Niols commented Jan 21, 2025

As a single-line comment gets attached to the node after, it might lose wanted indentation, eg.:

[
  foo;
  (* bar; *)
]

becomes:

[
  foo;
(* bar; *)
]

and

module F = struct
  type t = unit
  (** comment for t *)
end

becomes:

module F = struct
  type t = unit
(** comment for t *)
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants