Why Sponsor Oils? | source | all docs for version 0.25.0 | all versions | oils.pub
This doc has rough notes on the architecture of the parser.
How to Parse Shell Like a Programming Language (2019 blog post) covers some of the same material. (As of 2024, it's still pretty accurate, although there have been minor changes.)
The test suite test/lossless.sh invokes osh --tool lossless-cat $file
.
The lossless-cat
tool does this:
Now, do the tokens "add up" to the original file? That's what we call the lossless invariant.
It will be the foundation for tools that statically understand shell:
--tool ysh-ify
- change style of do done
→ { }
, etc.--tool fmt
- fix indentation, maybe some line wrappingThe sections on re-parsing explain some obstacles which we had to overcome.
Oils uses regex-based lexers, which are turned into efficient C code with re2c. We intentionally avoid hand-written code that manipulates strings char-by-char, since that strategy is error prone; it's inevitable that rare cases will be mishandled.
The list of lexers can be found by looking at native/fastlex.c.
echo -e
PS1
backslash escapes.!$
.${x/foo*/replace}
via conversion to ERE. We need
position information, and the fnmatch()
API doesn't provide it, but
regexec()
does.
These constructs aren't recognized by the Oils front end. Instead, they're punted to libc:
*.py
(in most cases)@(*.py|*.sh)
strftime
format strings, e.g. printf '%(%Y-%m-%d)T' $timestamp
osh/word_parse.py calls lexer.MaybeUnreadOne()
to handle right
parens in this case:
(case x in x) ;; esac )
This is sort of like the ungetc()
I've seen in other shell lexers.
This section is about extra passes / "irregularities" at parse time. In the "Runtime Issues" section below, we discuss cases that involve parsing after variable expansion, etc.
We try to avoid re-parsing, but it happens in 4 places.
It complicates error messages with source location info. It also implications
for --tool ysh-ify
and --tool fmt
, because it affects the "lossless invariant".
This command is perhaps a quicker explanation than the text below:
$ grep do_lossless */*.py
...
osh/cmd.py: ...
osh/word_parse.py: ...
Where re-parse:
Here documents: We first read lines, and then parse them.
VirtualLineReader
in osh/cmd_parse.pyArray L-values like a[x+1]=foo
. bash allows splitting arithmetic
expressions across word boundaries: a[x + 1]=foo
. But I don't see this
used, and it would significantly complicate the OSH parser.
_MakeAssignPair
in osh/cmd_parse.py has do_lossless
conditionBackticks, the legacy form of $(command sub)
. There's an extra level
of backslash quoting that may happen compared with $(command sub)
.
_ReadCommandSubPart
in osh/word_parse.py has do_lossless
conditionysh-ify
or fmt
toolsalias
expansion
SnipCodeString
in osh/cmd_parse.pyalias ls=foo
. So it doesn't affect the lossless
invariant that --tool ysh-ify
and --tool fmt
use.These language constructs are handled statically, but not in a single pass of parsing:
FOO=bar declare a[x]=1
.
We make another pass with _SplitSimpleCommandPrefix()
.
s=1
doesn't cause reparsing, but a[x+1]=y
does.echo {a,b}
echo ~bob
, home=~bob
This is less problematic, since it doesn't affect error messages
(ctx_SourceCode
) or the lossless invariant.
myfunc() { echo hi; }
vs. myfunc=() # an array
shopt -s parse_equals
: For x = 1 + 2*3
alias foo='ls | wc -l'
. Aliases are like
"lexical macros".$PS1
and family first undergo \
substitution, and
then the resulting strings are parsed as words, with $
escaped to \$
.eval
trap
builtin
source
— the filename is formed dynamically, but the code is generally
static.All of the cases above, plus:
(1) Recursive Arithmetic Evaluation:
$ a='1+2'
$ b='a+3'
$ echo $(( b ))
6
This also happens for the operands to [[ x -eq x ]]
.
Note that a='$(echo 3)'
results in a syntax error. I believe this was
due to the ShellShock mitigation.
(2) The unset
builtin takes an LValue. (not yet implemented in OSH)
$ a=(1 2 3 4)
$ expr='a[1+1]'
$ unset "$expr"
$ argv "${a[@]}"
['1', '2', '4']
(3) printf -v takes an "LValue".
(4) Var refs with ${!x}
takes a "cell". (not yet implemented OSH.
Relied on by bash-completion
, as discovered by Greg Price)
$ a=(1 2 3 4)
$ expr='a[$(echo 2 | tee BAD)]'
$ echo ${!expr}
3
$ cat BAD
2
(5) test -v takes a "cell".
(6) ShellShock (removed from bash): export -f
, all variables were checked for
a certain pattern.
test
/ [
, e.g. [ -a -a -a ]