Why Sponsor Oils? | source | all docs for version 0.25.0 | all versions | oils.pub
(March 2024)
Oils parses a lot of text, and it's becoming apparent than we need a print a lot too! The text should be nicely formatted because a shell is a user interface.
This doc describes 4 possible pretty printers in Oils. Traditional shells don't have appear to have any pretty printing.
Let's be concrete first, because there's a lot of brainstorming below.
YSH prints its JSON-like data structures with the =
keyword, which
takes a Python-like expression on the right:
Right now, it looks bad on big data structures. It should look something like
nodejs
or jq
:
We may want to omit the quotes, like nodejs
. (This syntax isn't meant to be
parsed. JSON8 may have unquoted dict keys, although it's not
essential).
go fmt
style versus PPL styleTo back up a bit, I'm writing this doc organize my thoughts, and to explain problems and requirements to contributors.
There are at least two schools of thought on pretty printers, which this lobste.rs thread has a good discussion of:
More comments on a blog post by Justin Pombrio (which I circulated):
Let's call the two styles the "go fmt
style" and the "PPL style" (functional
pretty printing language).
I was probably "biased" toward go fmt
, because the two formatters we actually
use in Oils are influenced by it:
clang-format
for our C++ code.
yapf
for our Python code.
clang-format
. (It's slow, mostly due
to being written in Python, and creating lots of tiny objects!)(Details: they have line wrapping algorithms, while go fmt
doesn't. I'm not
calling them "graph search" printers, because I think of line wrapping as a
subproblem of pretty printing.)
However, Justin's post helped me understand Wadler's printer, a popular example of the PPL style. This style might have some advantages for Oils:
go fmt
does,
etc.)node.js
-like columnar layout.(Does that last idea work?)
Sometimes I pile on too many requirements, which I mentioned in the latest release announcement:
We should do the simplest things first, and I think the PPL approach will allow that.
BTW there are many threads on #data-languages
on Zulip where I'm
brainstorming and learning about pretty printing.
Here's a sketch of what I think we need. It goes from concrete → experimental and research-y.
What is it? This is the =
keyword shown in the screenshots above. (BTW,
Lua uses the same syntax =
to evaluate expressions.)
Motivation: We should look as good as node.js
or jq
.
Currently we use our JSON printer with the options SHOW_NON_DATA | SHOW_CYCLES
.
SHOW_NOW_DATA
prints non-data objects, like <Func 0x123>
. This syntax
can't be parsed.SHOW_CYCLES
prints cycles with -->
, instead of raising an error, like
JSON does.Example:
ysh$ var d = {}
ysh$ setvar d.eggex = /dot+/ # Eggex object
ysh$ = d
(Dict 0x7feb87cb4050) {"eggex":<Eggex 0x7feb87dbfd00>}
ysh$ setvar d.cycle = d
ysh$ = d
(Dict 0x7feb87cb4050) {"eggex":<Eggex 0x7feb87dbfd00>,"cycle":{ --> 0x7feb87cb4050 }}
We should replace this with a new pretty printer that wraps lines, and has COLOR.
What is it? Zephyr ASDL is the statically typed schema language we use to implement Oils. It's "one level down" from the shell.
We used it to define the syntax of shell with algebraic data types. We create a lossless syntax tree, which is also an IR for shell.
Motivation: We already wrote an ad hoc pretty printer, and it should be replaced! It tries to fit records on a single line, and if that fails, it uses multiple lines. I think it's slow.
If we already have a YSH data printer, this printer should "obviously" be unified with it. We should have a nice separation of policy and mechanism.
Use osh -n myscript.sh
to see what it does:
Notes:
PrettyTree()
method, which converts
the typed self
or this
to the homogeneous hnode
tree.AbbreviatedTree()
is a bit like the modular printers discussed in the
lobste.rs
thread. It makes certain common data structures more
readable, with type-specific logic. It's in Python only — can that
logic also be in C++?pp asdl (obj)
, another debugging feature.osh -n benchmarks/testdata/configure-coreutils
tests a huge shell fileWhat is it? A formatter for shell code. I think it's more essential to have a YSH formatter, but an OSH formatter is also possible. They both use the same lossless syntax tree.
Motivation - Formatters make a new language easier to use, and there are many users who don't know shell well.
For example, I don't know TypeScript well, and I had a good experience with
deno fmt
. It reduced the mental load of adopting a new tool.
Justin had a nice idea on on lobste.rs
- we should create coarse tree
with these elements:
{ }
affect indentation in YSH
then elif else fi
, do done
,
etc.( )
in YSH changes the lexer from command mode to expression
modetext()
/ chunkswhile for if
begin blocks of codeWhy? We don't don't want to take responsibility for every formatting decision!
I actually think the command mode / statement formatter should be non-wrapping, while expressions can wrap. Commands will likely be more common than expressions in most YSH code.
The formatter will be invoked with ysh --tool format myfile.ysh
.
It can also be used with the output of osh --tool ysh-ify
, which roughly
translates OSH to YSH (doesn't preserve semantics). This may
help generate test data, since there's plenty of shell code in the wild,
but not much YSH code.
What is it? A more human AND machine-readable format for the syntax tree, which is actually a graph.
Motivation: The pretty-printed AST could be a parseable format. Allow users to reuse all the hard work we did parsing shell!
Printing raw ASDL data is useful, but it could be more readable with custom logic to print the natural layers of the graph. There are 4 logical layers in frontend/syntax.asdl:
source_t
describes whether shell code comes from foo.sh
or ysh -c 'echo mycode'
SourceLine
represents physical lines of codeToken
refers to portions of linescommand_t word_t word_part_t expr_t
. The leaves are
tokens.And instead of a pretty format meant for humans, we may want to print an s-expression-like format I'm calling "NIL8".
NIL8 can be parsed. You may want to write tree-shaking code to deploy YSH into containers.
More on NIL8 below. Again, it's experimental.
Risks:
Rc<T>
)'''
is significantI added some stubs in the code:
To generate Python code from the ASDL schema, run build/py.sh all
.
Otherwise, Oils is a plain Python 2 program, with a few C extensions.
C++ translation is a separate step, and it's now pretty polished.
For new contributors:
There is also a stub for the formatter:
This section has some less concrete thoughts.
I think the PPL IRs are also useful if you're not line wrapping? You can just fix indentation.
nodejs
does this in its console.log()
.
Future work? We should get the basic pretty printer working first.
Making the coarse tree has some similarity to syntax highlighting. I wrote a
simple syntax highlighter for 5+ languages called micro-syntax
, and it should
support YSH too.
Sketch:
Comment | Code | StringLiteral
{} ()
EndLineComment
| BeginLineComment
Then do a trivial linear pass to fix up indentation. The { }
or do done
tokens determine indentation.
Though honestly it's probably better to just reuse our elaborate parser at first. I like to "compress" different algorithms together, but it may not be worth it here.
What is "NIL8"? We don't know if it's a good idea yet, but it may be part of J8 Notation.
Think:
If NIL8 can represent both the lossless syntax tree and a new IR for a mycpp rewrite ("yaks"), that's a good test of the design.
Note that the AST is a statically typed data structure, which means we may also want to export the ASDL schema as NIL8!
Links:
At a high level, we're trying to nudge users toward a small set of syntaxes for shell-like programming, rather than inventing ad hoc syntax every time. String literals are a pain point: people often implement them badly, or not at all.
I think we should have shared infrastructure for 3 printers:
And then there's this idea of "replacing" or rationalizing the ASDL syntax tree with "NIL8":
Pretty printing is adjacent to other fun problems in Oils, like GC performance, "boxless" optimization, etc.
Things to think about:
(1) Unified Code Representation for Oils
We want to pack all these tools into Oils:
ysh-ify
- a VERY rough translation of OSH to YSH
Related: Retrospective on the Go ast
package.
(2) Yaks - mycpp from the bottom up
NIL8 leads into "Yaks", which is an IR for garbage collected C++.
(3) Pretty printing will cause many small allocations.
I think that naive implementations should be fast enough. If not, any slowness may be due to allocation, not necessarily the pretty printing algorithm itself.
Some solutions:
This is about ref counting for printing graph-shaped data.
I no longer think this is as important. I think we should manually construct 4 layers of the graph, as described in the section on formatter (4).
value_t
)Similar to JSON / JSON8 printing, except we
...
instead of repeatingThis is a global pass that computes a Dict[int, int]
object ID -> number of times referenced in the graph
The graph is specified by single root node, e.g. the argument to
pp value (obj)
Pass this dict into the second step.
value.List -> hnode.Compound with []
value.Dict -> hnode.Compound with {}
null, true, false -> Atom
Cycle detected -> Atom, with { --> 4beef2 }
[ --> 4beef2 ]
Repetition:
{ key: { ... 4beef2 }, key2: { ... 4beef2 }
Or maybe omit the type, since strings don't have that:
{ key: ... 4beef2, key2: ... 4beef2 }
The schema looks like this now?
hnode =
Atom(str s, color color) - # External objects can use this?
| Compound(hnode* items)
The length of 'str s' is the input to line wrapping.
TODO: what's the heuristic here? Is it global?
Dynamic programming?
do we insert hnode.Newline() or something?
Reduce it to the case above.
We do this all at once?
Because to convert to value.Record, you have to do cycle detection anyway.
And that's similar to ref counting.
value =
...
| Record(str type_name, Dict[str, value_t] fields)
The special "-" key can be used for JSON:
{"-": "command.Simple, "name": "hi"}
Though this loses some information, and it doesn't solve the problem with shared references. We would need Packle for that.
Is this separate? Or part of step 2.
We need something between value.Record and hnode.Compound to do ABBREVIATION:
Also need nodes for
...
means already printed---
means CANNOT print, because it's a cycle@1f23
- ID if already printed, or in a cycleIdentical to the dynamically typed case above.