-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process comments (and extra nodes) separately from the rest of the code #500
base: main
Are you sure you want to change the base?
Conversation
In order to fix #430, this should be done for OCaml attributes too. They are not included with the current heuristic. Attributes can appear almost everywhere in OCaml. To avoid overly complicating the grammar, tree-sitter-ocaml really allows them everywhere, even in some invalid places. If you replace the comments in the example of #489 with attributes, the idempotency rule will still be violated (the input will be parseable by tree-sitter, but the result will not be). But since the code was invalid OCaml to begin with, that's not a huge problem. |
I think this heuristic can be avoided by using the languages.toml file. |
48bfcc0
to
f5da50b
Compare
f5da50b
to
936a925
Compare
936a925
to
43a106e
Compare
43a106e
to
74199c4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly follow this and it feels like it's on the right track. I did a commit-wise review, so some of my suggestions against earlier commits may no longer apply, or apply elsewhere.
I appreciate the modularisation of comments and types and the comments you added to the source to describe what's going on. This is quite a tricky operation, so those comments are super-encouraged to save our future selves' sanity 😅
d566486
to
edc5e08
Compare
723f95f
to
5fb3642
Compare
0b8f39b
to
765dda7
Compare
2a0f34b
to
c20032b
Compare
If there were the following atoms in succession: `[Space, Hardline, IndenStart]` then the function would process them as `[Empty, Hardline, IndentStart]` instead of `[Empty, IndentStart, Hardline]` because the whitespace collation code would skip a step.
c20032b
to
0e1d3a1
Compare
a2990d4
to
e2a01c9
Compare
019848b
to
8d48d37
Compare
Instead of using ad hoc logic to identify comment nodes, this commit uses a separate query file.
8d48d37
to
7a31f23
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is as awesome as it is epic! Amazing work 🙇
-
I've gone through this pretty thoroughly, but I must confess to not fully understanding everything. That said, you've named things very clearly, so I've used those names as a crux wherever I don't follow. I've added comments if/when this "method" breaks down a bit.
-
The PR description is very clear and I think -- if updated to reflect some of the changes since it was written -- it would make for good documentation in the
README
. (I've added some notes about the PR description below.) -
With regard to line-embedded comments /* like this */ what's the reason why they can't be anchored to the node (or the sibling node) immediately before the comment?
(Similarly, for "orphaned" comments: Can they just assume the formatting of the current context, appended with a hardline, or is that not a safe fallback?)
-
Meta: It's a real shame that Tree-sitter doesn't have a way, AFAIK, to introspect grammars. It would be really nice if one could just get a list of all "extra" nodes in any grammar, then you wouldn't need to bother with the
language.comment.scm
files... Perhaps worth an issue/feature request in the Tree-sitter repo 🤷 -
Meta: @Niols, would you be able to experiment with this branch against your OCaml codebase(s), paying attention to how it affects comment formatting? It would be really interesting to see how it plays with real world code.
Notes on the PR description, so it can be converted into documentation easily:
Identifying comments
-
It would be good to expand on why this isn't ideal, for the sake of the documentation:
- We can't use the same mechanism we use for leaf identification, i.e. a formatting directive in the query file that would look like
(block_comment) @comment
. Indeed, it would mean either running the query twice, or accounting for the presence of comments in the rest of the query file, both situations we want to avoid.
- We can't use the same mechanism we use for leaf identification, i.e. a formatting directive in the query file that would look like
-
Update this to match that you've gone with the second option; that is, comment query files:
- The current implementation selects comment nodes from the CST with the following heuristics:
node.is_extra() && node.kind().to_string().contains("comment")
. I think it should capture all types of comments for all supported languages, but I haven't checked that yet. - Another option would be to have a separate query file for comments, one for each language, which would contain selecting and formatting directives for comments. It could also contain instructions on how to re-insert comments back in the code, once it's formatted.
- The current implementation selects comment nodes from the CST with the following heuristics:
Anchoring comments
-
Your heuristics make perfect sense to me and seem eminently sensible. Some ASCII-art diagrams for the first two cases would make this section ideal for the documentation.
It would also be worth documenting the:
- interleaved comment edge-case (modulo my question, above);
- and the case that the OCaml sample exposes, when comments are put after their node.
-
Presumably, the second sentence in this heuristic isn't implemented:
- If the comment node is alone on its line, anchor it to its next sibling in the CST. If it's the last of its siblings, anchor it to its previous sibling instead.
That's why we see final child comments losing their indent level, for example.
-
It looks like you bail out in the only child case:
- I haven't thought at what it would mean for a comment node to be an only child in the CST, nor if that case can actually happen. In that case, it would probably be a safe bet to anchor it to its parent node.
$ cargo run -- fmt -lbash <<'SH' > ( > # foo > ) > SH [2025-01-13T16:10:48Z ERROR topiary] Found an anchored comment, but could not attach it back to its anchor # foo The anchor was CommentedAfter { section: InputSection { start: Position { row: 2, column: 3 }, end: Position { row: 2, column: 3 } }, blank_line_after: false, blank_line_before: false }
I agree with you that instances of only child comments are likely to be rare. Perhaps an option could be to just assume the formatting imposed by the parent as a fallback. Does that even make sense?
Re-inserting comments
- I think this reads well; it just needs to be tweaked to change from the subjunctive to the indicative.
} | ||
|
||
// Empty block inside of impl function | ||
impl MyTrait for MyStruct { | ||
fn foo() { | ||
// ... logic ... | ||
// ... logic ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not a case of an only child comment? How come this one doesn't bail out?
; just doing _ above doesn't work, because it matches the final named node as | ||
; well as the final non-named node, causing double indentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the second sentence of this part of the heuristic happen?
If the comment node is alone on its line, anchor it to its next sibling in the CST. If it's the last of its siblings, anchor it to its previous sibling instead.
I haven't looked at the implementation yet, so perhaps that part is not done.
@@ -0,0 +1,2 @@ | |||
; Identify nodes for comment processing | |||
(comment) @comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// The user specified a query file | ||
Some(p) => p, | ||
Some(p) => (p, None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if --query
is used, then the comment query is None
? That feels a bit surprising and that ought to be clearly documented.
Alternatively, what do you think about: If the query file is p
, then the comment query file is:
Some(q)
, whereq := replace(p, '${language}.scm', '${language}.comment.scm')
, ifq
exists.None
, otherwise.
Similar to what you do in topiary-config/src/language.rs
... Too magical?
@@ -13,6 +13,9 @@ pub struct Language { | |||
/// The Query Topiary will use to get the formating captures, must be | |||
/// present. The topiary engine does not include any formatting queries. | |||
pub query: TopiaryQuery, | |||
/// The Query Topiary will use to determine which nodes are comments. | |||
/// When missing, ther ewill be no separate comment processing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// When missing, ther ewill be no separate comment processing. | |
/// When missing, there will be no separate comment processing. |
/// Maps node IDs to comments before that node. Comments are stored in reading order. | ||
comments_before: HashMap<usize, VecDeque<Comment>>, | ||
/// Maps node IDs to comments after that node. Comments are stored in reading order. | ||
comments_after: HashMap<usize, VecDeque<Comment>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All comments before/after, or just those immediately before/after?
// REVIEW: the double borrow is sort of weird, is this idiomatic? | ||
if node_section.contains(&(&comment).into()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't speak to whether it's idiomatic (@yannham?); however, it makes sense here:
comment
is anAnchoredComment
and you only implementFrom<&AnchoredComment> for InputSection
; hence the first reference. You need to usecomment
later, so taking a reference makes sense otherwise you'll get a borrow violation.InputSection::contains
's argument expects a reference; hence the second. Here, the convertedInputSection
is only used once, so maybe one could implementFrom<&AnchoredComment> for &InputSection
, but that would probably require careful lifetime annotations.
My opinion is that the double-&
is probably a better option than an additional, possibly complex impl
that's only called once.
match comment.commented { | ||
Commented::CommentedAfter { .. } => self | ||
.comments_before | ||
.entry(id) | ||
.or_default() | ||
.push_back((&comment).into()), | ||
Commented::CommentedBefore(_) => self | ||
.comments_after | ||
.entry(id) | ||
.or_default() | ||
.push_back((&comment).into()), | ||
} | ||
} else { | ||
comments.push(comment); | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All comments before/after, or just those immediately before/after?
I believe this code answers my question: It's the just comments immediately before/after...I think.
let comment_atom = Atom::Leaf { | ||
content, | ||
id: self.next_id(), | ||
original_column, | ||
single_line_no_indent: false, | ||
multi_line_indent_all: true, | ||
}; | ||
comments_queue_with_context.push_front(CommentWithContext { | ||
comment: comment_atom, | ||
blank_line_after, | ||
blank_line_before, | ||
}); | ||
} | ||
atoms_with_comments_before.push_front(atom); | ||
} else if Atom::Hardline == atom || Atom::Blankline == atom { | ||
let mut blank_line_before_first_comment = false; | ||
// Prepend the comments, each one followed by a newline | ||
while let Some(CommentWithContext { | ||
comment, | ||
blank_line_after, | ||
blank_line_before, | ||
}) = comments_queue_with_context.pop_back() | ||
{ | ||
if blank_line_after { | ||
atoms_with_comments_before.push_front(Atom::Blankline) | ||
} else { | ||
atoms_with_comments_before.push_front(Atom::Hardline); | ||
} | ||
atoms_with_comments_before.push_front(comment); | ||
blank_line_before_first_comment = blank_line_before; | ||
} | ||
if blank_line_before_first_comment && atom == Atom::Hardline { | ||
atoms_with_comments_before.push_front(Atom::Blankline); | ||
} else { | ||
atoms_with_comments_before.push_front(atom); | ||
} | ||
} else { | ||
atoms_with_comments_before.push_front(atom); | ||
} | ||
} | ||
let mut blank_line_before_first_comment = false; | ||
// If we still have comments left, add them at the beginning of the file | ||
while let Some(CommentWithContext { | ||
comment, | ||
blank_line_after, | ||
blank_line_before, | ||
}) = comments_queue_with_context.pop_back() | ||
{ | ||
if blank_line_after { | ||
atoms_with_comments_before.push_front(Atom::Blankline) | ||
} else { | ||
atoms_with_comments_before.push_front(Atom::Hardline); | ||
} | ||
atoms_with_comments_before.push_front(comment); | ||
blank_line_before_first_comment = blank_line_before; | ||
} | ||
if blank_line_before_first_comment { | ||
atoms_with_comments_before.push_front(Atom::Blankline); | ||
} | ||
|
||
// Second pass: get atoms in reverse order, put comments that come after atoms at the correct position | ||
let mut atoms_with_all_comments: VecDeque<Atom> = VecDeque::new(); | ||
let mut comments_queue: VecDeque<Atom> = VecDeque::new(); | ||
while let Some(atom) = atoms_with_comments_before.pop_front() { | ||
if let Atom::Leaf { id, .. } = atom { | ||
while let Some(Comment { | ||
content, | ||
original_column, | ||
// We don't care about blank lines here: they only matter for comments that | ||
// come before atoms. | ||
.. | ||
}) = self.comments_after.entry(id).or_default().pop_front() | ||
{ | ||
let comment_atom = Atom::Leaf { | ||
content, | ||
id: self.next_id(), | ||
original_column, | ||
single_line_no_indent: false, | ||
multi_line_indent_all: true, | ||
}; | ||
comments_queue.push_front(comment_atom); | ||
} | ||
} | ||
if Atom::Hardline == atom || Atom::Blankline == atom { | ||
// Append the comments, each one preceded by a space | ||
while let Some(comment) = comments_queue.pop_back() { | ||
atoms_with_all_comments.push_back(Atom::Space); | ||
atoms_with_all_comments.push_back(comment); | ||
} | ||
} | ||
atoms_with_all_comments.push_back(atom); | ||
} | ||
// If we still have comments left, add them at the end of the file | ||
while let Some(comment) = comments_queue.pop_back() { | ||
atoms_with_all_comments.push_back(Atom::Space); | ||
atoms_with_all_comments.push_back(comment); | ||
} | ||
|
||
self.atoms = atoms_with_all_comments.into() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of see what's going on here, but I find it very hard to follow. Your comments help, but I think this big function would benefit from more documentation on the algorithm and the data structures that are being used.
self.post_process_inner(); | ||
// Comments must be processed before deletes, because comment anchors might be deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that happens, then does that result in an orphaned comment error?
At this point: let dispatch = function
| None -> (* FIXME: 404 page *) assert false
let () = () transforms into: let dispatch = function
| None -> assert false (* FIXME: 404 page *)
let () = () and then: let dispatch = function
| None -> assert false (* FIXME: 404 page *)
let () = () leading to an idempotency error. |
type t =
(
string,
string,
(string * person Slug.t list),
(* bool, *)
string,
string,
unit, (* FIXME *)
string
) gen
[@@deriving yojson] gets formatted by current type t =
(string, string, (string * person Slug.t list), string, string,
(* bool, *)
unit, (* FIXME *)
string) gen
[@@deriving yojson] Both the former and the latter get formatted by this PR as: type t =
(* bool, *)
(string, string, (string * person Slug.t list), string, string, unit, string) gen (* FIXME *)
[@@deriving yojson] Related to #816 for the one-line-ness of the result. |
if foo (* youpi *)
then bar
else baz becomes if foo then bar (* youpi *)
else baz And same without |
Last comment of a file gets put on the same line as the last non-comment, eg.: let foo = ()
(** youpi foo *)
let bar = ()
(** youpi bar *) becomes: let foo = ()
(** youpi foo *)
let bar = () (** youpi bar *) This happens even with an empty line before |
As a single-line comment gets attached to the node after, it might lose wanted indentation, eg.: [
foo;
(* bar; *)
] becomes: [
foo;
(* bar; *)
] and module F = struct
type t = unit
(** comment for t *)
end becomes: module F = struct
type t = unit
(** comment for t *)
end |
This PR is an attempt at processing the comments separately from the rest of the code, so that formatting queries can be both resilient to the presence of random comments, and simple to write.
Identifying comments
This is a challenge in itself:
(block_comment) @comment
. Indeed, it would mean either running the query twice, or accounting for the presence of comments in the rest of the query file, both situations we want to avoid.node.is_extra() && node.kind().to_string().contains("comment")
. I think it should capture all types of comments for all supported languages, but I haven't checked that yet.Anchoring comments
Each comment should be anchored to a non-comment node of the code. Ideally, it should be the node it's supposed to comment, but such semantics can't be deduced from the CST alone. This PR uses the following heuristics instead:
Extracting comments
tree-sitter
grammars and queries don't offer lots of tools to edit an existing CST. The only reasonable way I've found is to useinput.replace_range()
to remove the comment's bytes from the input, thentree.edit()
to mark a node as edited in the CST, then finallyreparse(old_tree, new_input, grammar)
to get the new CST, without comments. The query file would then be applied to this new CST.Re-inserting comments
This is a part I haven't had time to experiment with, but I think re-insertion should be done after processing all append/prepend directives, but before post-processing. I imagine something like this should work:
(anchor).(space).(comment)
(line breaks should already have been taken care of).(comment).(line break).(anchor)
.State of the PR