惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

V2EX - 技术
V2EX - 技术
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
Latest news
Latest news
T
The Exploit Database - CXSecurity.com
博客园 - 三生石上(FineUI控件)
WordPress大学
WordPress大学
L
Lohrmann on Cybersecurity
aimingoo的专栏
aimingoo的专栏
B
Blog
T
Threat Research - Cisco Blogs
罗磊的独立博客
Application and Cybersecurity Blog
Application and Cybersecurity Blog
P
Proofpoint News Feed
P
Palo Alto Networks Blog
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
宝玉的分享
宝玉的分享
博客园 - 司徒正美
Google DeepMind News
Google DeepMind News
Blog — PlanetScale
Blog — PlanetScale
T
Tor Project blog
阮一峰的网络日志
阮一峰的网络日志
Last Week in AI
Last Week in AI
Martin Fowler
Martin Fowler
酷 壳 – CoolShell
酷 壳 – CoolShell
Recorded Future
Recorded Future
D
DataBreaches.Net
Y
Y Combinator Blog
大猫的无限游戏
大猫的无限游戏
IT之家
IT之家
B
Blog RSS Feed
Scott Helme
Scott Helme
P
Proofpoint News Feed
V
Vulnerabilities – Threatpost
A
Arctic Wolf
Help Net Security
Help Net Security
L
LINUX DO - 最新话题
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Vercel News
Vercel News
AWS News Blog
AWS News Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
S
Schneier on Security
Hacker News: Ask HN
Hacker News: Ask HN
N
Netflix TechBlog - Medium
L
LangChain Blog
博客园 - 叶小钗
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
M
MIT News - Artificial intelligence
N
News and Events Feed by Topic
Webroot Blog
Webroot Blog
W
WeLiveSecurity

seize the dev

Will Software Engineering Survive? Why is the Gmail app 700 MB? Bits of Open-Source in 2025 A Tricky Floating-Point Calculation Improving Date Formatting Performance in Node.js Unix is not Linux Installing FreeBSD on Oracle Cloud A Simple Setup for C and C++ How to Break Software
Parsing JSON in Forty Lines of Awk
Mohamed Akram · 2025-03-09 · via seize the dev

JSON is not a friendly format to the Unix shell — it’s hierarchical, and cannot be reasonably split on any character (other than the newline, which is not very useful) as that character might be included in a string. There are well-known tools such as jq that let you correctly parse JSON documents in the shell, but all require an additional dependency. Another option is to use Python, which is ubiquitous enough that it can be expected to be installed on virtually every machine, and for new projects would be the recommended option.

However, I already had a working POSIX shell script that now had a requirement to read and parse JSON. It had previously extracted values from HTML which, while also being hierarchical, can be reliably split on certain characters (the angle brackets) for basic extraction of values. awk is the closest thing to a real programming language that’s available in the POSIX shell, so I thought I’d try to write a basic JSON parser in it. I had already written a full-blown one before, so I knew it was doable, but I needed something more concise.

First, there are some caveats. JSON is notoriously tricky to get completely right, despite its simple grammar. The following code assumes that it will be fed valid JSON. It has some basic validation as a function of the parsing and will most likely throw an error if it encounters something strange, but there are no guarantees beyond that. In my case, I’m reading JSON from a single, trusted source, so this is an acceptable constraint.

The interface is simple, a single function that accepts a JSON document and a dotted path to a key or array index, and returns the corresponding value. It can be used like so:

# Get one value
name = decode_json_string(get_json_value(json, "author.name"))

# Loop over an object
get_json_value(json, "dependencies", deps)
for (name in deps)
	version = decode_json_string(deps[name])

# Loop over an array
get_json_value(json, "payload.items", items)
for (i = 0; items[i]; i++) {
	get_json_value(items[i], null, item)
	type = decode_json_string(item["type"])
	name = decode_json_string(item["name"])
}

To keep things simple, the same function handles both arrays and objects. In JavaScript, arrays are roughly equivalent to objects with integer keys, and we use the same approach here. This is the implementation, expanded and annotated:

# The function takes three parameters: the JSON object/array, the desired key,
# and an optional array to be filled if the key points to an object or array.
# The rest are local variables (awk only allows local variables in the form
# of function parameters)
function get_json_value( \
	s, key, a,
	skip, type, all, rest, isval, i, c, k, null \
) {
	# Trim leading whitespace, if any
	if (match(s, /^[[:space:]]+/)) s = substr(s, RLENGTH+1)

	# Get the type of value by its first character
	type = substr(s, 1, 1)

	# This variable is needed for when we recursively call the function
	# It will be true if the key argument is undefined, since such
	# variables can behave as either a string or a number in awk
	all = key == "" && key == 0

	# If this is a primitive
	if (type != "{" && type != "[") {
		# Ensure a key is not passed
		if (!all) error("invalid json array/object " s)

		# Parse the value
		if (!match(s, /^(null|true|false|"(\\.|[^\\"])*"|[.0-9Ee+-]+)/))
			error("invalid json value " s)

		# And return it
		return substr(s, 1, RLENGTH)
	}

	# Get the first part of the key (which we will be looking for)
	# if the path is dotted and save the rest for now
	if (!all && (i = index(key, "."))) {
		rest = substr(key, i+1)
		key = substr(key, 1, i-1)
	}

	# isval keeps track of whether we are looking at a JSON key or value
	# In an array, all items are values
	# k is the current key
	# If this is an array, it is the index, which starts at 0
	if ((isval = type == "[")) k = 0

	# Loop over the characters in the provided JSON
	# Skip the opening brace or bracket (to avoid infinite recursion) and
	# increment the index by the length of the token
	for (i = 2; i <= length(s); i += length(c)) {
		# Skip over whitespace
		if (match(substr(s, i), /^[[:space:]]+/)) {
			c = substr(s, i, RLENGTH)
			continue
		}

		# Temporarily assign the first character to our token variable
		c = substr(s, i, 1)

		# If it's a closing brace or bracket, we've reached the end of
		# the object or array, so exit the loop
		if (c == "}" || c == "]") break

		# If we find a comma in an object, the next item will be a key,
		# so reset isval. If it's an array, increment the index
		else if (c == ",") { if ((isval = type == "[")) ++k }

		# If we see a colon, the next token will be a value
		else if (c == ":") isval = 1

		# Otherwise, we expect a JSON value
		else {
			# If the key matches, this is our desired value,
			# so pass the rest of the key and return the result
			if (!all && k == key && isval)
				return get_json_value(substr(s, i), rest, a)

			# Otherwise, get the full value
			c = get_json_value(substr(s, i), null, null, 1)

			# And add it to the associative array
			if (all && !skip && isval) a[k] = c

			# If this is a string and we're not expecting a value,
			# then it's a key, so trim the quotes and save it
			if (c ~ /^"/ && !isval) k = substr(c, 2, length(c)-2)
		}
	}

	# Do a basic check that the object or array was properly closed
	if ((type == "{" && c != "}") || (type == "[" && c != "]"))
		error("unterminated json array/object " s)

	# If we're here, it means we didn't find the value we're looking for
	# so only return something if the whole array or object was requested
	if (all) return substr(s, 1, i)
}

To make the parser more useful, you’ll also need a function to do some decoding of JSON strings. This is a simple one, which handles everything except Unicode escape sequences, but throws an error if it encounters one:

function decode_json_string(s, out, esc) {
	if (s !~ /^"./ || substr(s, length(s), 1) != "\"")
		error("invalid json string " s)

	s = substr(s, 2, length(s)-2)

	esc["b"] = "\b"; esc["f"] = "\f"; esc["n"] = "\n"; esc["\""] = "\""
	esc["r"] = "\r"; esc["t"] = "\t"; esc["/"] = "/" ; esc["\\"] = "\\"

	while (match(s, /\\/)) {
		if (!(substr(s, RSTART+1, 1) in esc))
			error("unknown json escape " substr(s, RSTART, 2))
		out = out substr(s, 1, RSTART-1) esc[substr(s, RSTART+1, 1)]
		s = substr(s, RSTART+2)
	}

	return out s
}

And finally, since there is no built-in error function in awk, you can use something like this:

function error(msg) {
	printf "%s: %s\n", ARGV[0], msg > "/dev/stderr"
	exit 1
}