Wait, how do I scan text again?
This article is part of a series.
- Part 1: What if I just copy-paste from the web?
- Part 2: How do you get messages to Swift directly?
- Part 3: Okay, but how about all the way up to the View?
- Part 4: How to do some basic file handling?
- Part 5: How do custom Encoder's work?
- Part 6: And what can I make a custom Encoder do?
- Part 7: This Article
- Part 8: Date Parsing. Nose wrinkle.
- Part 9: What would be a very simple working Decoder?
The encoder spun our data out into a thread using serialization. A decoder crochets that thread back into an object.
If the data coming into the decoder is highly structured and narrow in scope it can actually be kind of relaxing to parse. In the Swift Talk decoder example (video (paywall) | github) the parsing ends up being pretty straight forward because their encoder only takes in a Route
, and the decoder only produce a Route
.
The general purpose mayhem of taking in any Codable that SimpleCoder performs is the whole point of an Encoder/Decoders pair (they are about the format, not the type) but it’s a lot to tackle in one bite. I’m not even using brackets to help calculate object relationships. Oof.
As a result making the decoder will be a much slower job.
- remember how to inspect strings in general
- decide about Dates specifically
- write a really really simple decoder to walk through the process
- inspect other peoples decoders
- write the string -> tree step for SimpleCoder
- finish the full decoder (with round trip tests)
In this post
First on the list, I’m going to catalog all the ways I know how to scan strings for information.
- String.builtinThing (0.000195)
- Moving the index (0.000113)
- Regex (\: 0.000399, builder: 0.000176)
- Scanner (v1 0.0000439, v2 0.0000512)
- Scanner + Dynamic Sequence (0.0000261)
- Custom char scanner (0.0000246)
The number in parenthesis is the average time to turn one example string into a Dictionary according to my XCTest set up. I’m focused on their relative order of magnitude.
func testStaticParserJ() throws {
measure {
do {
let dictJ = try HousePlantTest.parse_charScanner4(serializedHousePlant)
} catch {
fatalError()
}
}
}
I’m going to delay talking about making a Parser Combinator or other more substantial parser class because that would be complete overkill for a “MyFirstDecoder” Decoder, even for me.
Using built in String functions
Let’s imagine there is a HousePlant
struct and an encoded example. As long as we can get that string into a Dictionary<String, String>
, we can make a HousePlant
out of it. (I’m going to punt on Date parsing for now)
//Playground style
struct HousePlant:Codable {
let commonName:String
let whereAcquired:String?
let dateAcquired:String
let dateOfDeath:String?
static func parse(_ inputString:String) throws -> Dictionary<String,String> {
let fullKeyValueList = inputString.split(separator: "/")
var result:Dictionary<String,String> = [:]
return fullKeyValueList.reduce(into: result){ result, item in
let kvSplit = item.split(separator: ":")
let key = String(kvSplit[0]).trimmingCharacters(in: .whitespacesAndNewlines)
let value = String(kvSplit[1]).trimmingCharacters(in: .whitespacesAndNewlines)
result[key] = value
}
}
}
extension HousePlant {
init?(_ inputString:String) {
guard let dictionary = try? HousePlant.parse(inputString) else {
return nil
}
print("made dictionary", dictionary)
if let cN = dictionary["commonName"] {
print("commonName")
self.commonName = cN
} else {
return nil
}
if let dA = dictionary["dateAcquired"] {
print("dateAcquired")
self.dateAcquired = dA
} else {
return nil
}
self.whereAcquired = dictionary["whereAcquired"]
self.dateOfDeath = dictionary["dateOfDeath"]
}
}
let serializedHousePlant = "commonName:spider plant/whereAcquired:Trader Joe's/dateAcquired: 2024-03-12"
let housePlant = HousePlant(serializedHousePlant)
print(housePlant ?? "no plant")
String and StringProtocol have many built in functions to do basic scanning. In this case a split and a trim were good enough.
Move the index
The current parser works, but it could get into trouble. Imagine our starting text was instead
"commonName: spider plant /whereAcquired :Trader Joe's/dateAcquired: 2024-03-12 16:12:07"
The value contains a delimiter.
One way to accommodate strings with the delimiter character in them is to just stop at the first example of the delimiter. The quick and dirty way to do that would be to update the the current code to .split(separator: ":", maxSplits: 1)
Alternatively we can switch to moving through the string with a String.Index
. An advantage being there’s no risk of throwing an index out of bounds error anymore.
static func parse_stringIndex(_ inputString:String) throws -> Dictionary<String,String> {
let fullKeyValueList = inputString.split(separator: "/")
var result:Dictionary<String,String> = [:]
return try fullKeyValueList.reduce(into: result){ result, item in
if let firstColonIndex = item.firstIndex(of: ":") {
let key = item
.prefix(upTo: firstColonIndex)
.trimmingCharacters(in: .whitespacesAndNewlines)
let value = item
.suffix(from: item.index(firstColonIndex, offsetBy: 1))
.trimmingCharacters(in: .whitespacesAndNewlines)
result[String(key)] = String(value)
} else {
throw HousePlantError.notAKeyValuePair
}
}
}
RegEx example
I love regular expressions, although some do not. Understanding RegEx as a DSL for writing lexical analysis state machines improved how I use them. I’ve documented using them in Swift before to detect the header of a USD file (RegExBuilder version, back tracking to a literal)
- SIDEBAR: To see the pros detect a header look at the Package Manger ToolsVersionParser. It’s a 685 line file.
The non-regex version works fine for this, so I don’t think I’d actually bother in a real project.
I typically start making a regex by opening https://regex101.com. The scanner will need a to capture a repeated group. The below examples were created with the /gm flags and the original test string.
Here’s the evolution process:
([^\/:]+)
: works. simple. Tada. (one or more “not / or :” greedy.)(every item, could interleave.)(?:(?:\A|\/)([^\/]+)(?:$|\/))
: Explicit “between /”, consumes the /. Doesn’t grab middle groups.(?<=\/|^)(.*?)(?=\/|$)
: look ahead and behind version switch to.*?
(not greedy anything until delimiter)(?<=\/|^)(?:(.*?):(.*?))(?=\/|$)
: splits(?:^|\A|\G)(.*?)(?:\/|$)
:remove lookbehind, no split(?:^|\A|\G)(?:(.+?):(.+?))(?:\/|$)
: return the split(?:^|\A|\G)(?:\s*(.+?)\s*:\s*(.+?)\s*)(?:\/|$)
: trim white space(?:^|\A|\G)(?:\s*(?<key>.+?)\s*:\s*(?<value>.+?)\s*)(?:\/|$)
: add names
The final regex actually works for the new test string with the extra colon in it since the non-greedy indicator (the ?
in (.+?)
) will stop at the first :
match.
static func parse_regex(_ inputString:String) throws -> Dictionary<String,String>{
let pattern = /(?:^|\A|\G)(?:\s*(?<key>.+?)\s*:\s*(?<value>.+?)\s*)(?:\/|$)/
var dictionary:Dictionary<String, String> = [:]
let matches = inputString.matches(of: pattern)
for match in matches {
dictionary[String(match.output.key)] = String(match.output.value)
}
return dictionary
}
- NOTE: to get this to run in a Swift Package (swift-tools-version: 5.10) with the /regex/ notation add the following to the
Package.swift
file.
let swiftSettings: [SwiftSetting] = [
.enableUpcomingFeature("BareSlashRegexLiterals"),
]
for target in package.targets {
target.swiftSettings = target.swiftSettings ?? []
target.swiftSettings?.append(contentsOf: swiftSettings)
}
The same regex done with the RegEx Builder would look something like (generator to get started):
static func parse_regexLong(_ inputString:String) -> Dictionary<String,String>{
let key = Reference(Substring.self)
let value = Reference(Substring.self)
let pattern = Regex {
ChoiceOf {
Anchor.startOfLine //^
Anchor.startOfSubject //A
Anchor.firstMatchingPositionInSubject //G
}
Regex {
ZeroOrMore { CharacterClass.whitespace }
Capture(as: key) {
OneOrMore(.reluctant) { CharacterClass.any }
}
ZeroOrMore { CharacterClass.whitespace }
":"
ZeroOrMore { CharacterClass.whitespace }
Capture(as: value) {
OneOrMore(.reluctant) { CharacterClass.any }
}
ZeroOrMore { CharacterClass.whitespace }
}
ChoiceOf {
"/"
Anchor.endOfLine
}
}
var dictionary:Dictionary<String, String> = [:]
let matches = inputString.matches(of: pattern)
for match in matches {
dictionary[String(match[key])] = String(match[value])
}
return dictionary
}
One can use bare slash syntax in the builder!
let pattern = Regex {
/(?:^|\A|\G)/
/(?:\s*(.+?)\s*:\s*(.+?)\s*)/
/(?:\/|$)/
}
Do watch the WWDC22 Meet RegEx talk. It’s hilarious and informative.
- https://developer.apple.com/documentation/swift/regex
- https://www.hackingwithswift.com/swift/5.7/regexes
- https://developer.apple.com/wwdc22/110358 (the beyond the basics one)
- https://github.com/val-verde/swift-experimental-string-processing/blob/49418d2db5d58822db0149c03e78013183c45eeb/Sources/_MatchingEngine/Regex/AST/Atom.swift#L220
Scanner example
A number of examples use the built in Scanner class. Scanner
works like an Encoder in that you give the Scanner
the string to analyze and it holds it in its own memory.
- https://developer.apple.com/documentation/foundation/scanner
- https://www.swiftbysundell.com/articles/string-parsing-in-swift/
- https://nshipster.com/nsscanner/
- https://talk.objc.io/episodes/S01E13-parsing-techniques
static func parse_scanner(_ inputString:String) throws -> Dictionary<String,String> {
let scanner = Scanner(string: inputString)
var dictionary:Dictionary<String,String> = [:]
while !scanner.isAtEnd {
var key = scanner.scanUpToString(":")
key = key?.trimmingCharacters(in: .whitespacesAndNewlines)
let _ = scanner.scanCharacter()
var value = scanner.scanUpToString("/")
value = value?.trimmingCharacters(in: .whitespacesAndNewlines)
let _ = scanner.scanCharacter()
if let key, let value {
dictionary[key] = value
} else {
throw HousePlantError.notAKeyValuePair
}
}
return dictionary
}
Notice I’m still trimming the whitespace. The default charactersToBeSkipped
settings of the scanner will make the Scanner ignore any whitespace outside of a scan. For this string that means it will skip any “leading” white space for our keys and values (at the start of the string, immediately after the “:” or immediately after “/”). Once the scanner considers itself mid-scan it will no longer ignore the characters (discussion). That leaves all the trailing whitespace to lop off (and no .trimSuffix
to do it).
Another Scanner example, this time adding the delimiters to the skipped characters:
static func parse_scanner2(_ inputString:String) throws -> Dictionary<String,String> {
var dictionary:Dictionary<String,String> = [:]
let itemDelimiter = CharacterSet(charactersIn: "/")
let keyValueDelimiter = CharacterSet(charactersIn: ":")
let kvDelimAndWhite = CharacterSet()
.union(keyValueDelimiter)
.union(.whitespacesAndNewlines)
let allDelimAndWhite = CharacterSet()
.union(itemDelimiter)
.union(keyValueDelimiter)
.union(.whitespacesAndNewlines)
let scanner = Scanner(string: inputString)
scanner.charactersToBeSkipped = allDelimAndWhite
while !scanner.isAtEnd {
var key = scanner.scanCharacters(from: kvDelimAndWhite.inverted)
var value = scanner.scanCharacters(from: itemDelimiter.inverted)
value = value?.trimmingCharacters(in: .whitespacesAndNewlines)
if let key, let value {
dictionary[key] = value
} else {
throw HousePlantError.notAKeyValuePair
}
}
return dictionary
}
In this example a key should never have whitespace so the Scanner can run through whitespace characters as part of the delimiter search. No more trimming for the key.
Scanner + Dynamic Sequence
I once saw an interesting demo of combining a Scanner with a Dynamic Sequence (SE-0094 Review Post), which works especially well if one had multiple items to dig through. We aren’t parsing multiple HousePlants
yet, so I’ll put a different example in for reference.
func numbersInString(_ input:String) {
print("numbersInString")
let result = sequence(state: Scanner(string: input)) { scanner in
let string = scanner.scanCharacters(from: .decimalDigits.inverted )
print(string ?? "none found")
return scanner.scanInt()
}.map({ $0 })
print(result)
}
func numbersInSubString(_ input:String) {
print("numbersInSubString")
let result = sequence(state: Scanner(string: input)) { scanner in
scanner.scanUpToString("\n").map { subString in
print(subString)
return sequence(state: Scanner(string: subString)) { subScanner in
let string = subScanner.scanCharacters(from: .decimalDigits.inverted )
print(string ?? "none found")
return subScanner.scanInt()
}.map({ $0 })
}
}.map({ $0 })
print(result)
}
func testParse() {
numbersInString("hda23hfw78hdjila2889\nhda7991hfw12hdjila9\nhufie281sufvns0938dhqqj8837")
//[23, 78, 2889, 7991, 12, 9, 281, 938, 8837]
numbersInSubString("hda23hfw78hdjila2889\nhda7991hfw12hdjila9\nhufie281sufvns0938dhqqj8837")
//[[23, 78, 2889], [7991, 12, 9], [281, 938, 8837]]
}
This style provides a chance for the Scanner to just jump to the end of the Substring when getting a value like with the String.split("/") examples. Apparently it still saves some time even though the key chars probably get scanned twice.
static func parse_subScanner(_ inputString:String) throws -> Dictionary<String,String> {
var dictionary:Dictionary<String,String> = [:]
let itemDelimiter = CharacterSet(charactersIn: "/")
let keyValueDelimiter = CharacterSet(charactersIn: ":")
let kvDelimAndWhite = CharacterSet()
.union(keyValueDelimiter)
.union(.whitespacesAndNewlines)
let allDelimAndWhite = CharacterSet()
.union(itemDelimiter)
.union(keyValueDelimiter)
.union(.whitespacesAndNewlines)
let topScanner = Scanner(string: inputString)
topScanner.charactersToBeSkipped = allDelimAndWhite
sequence(state: topScanner) { topScanner in
topScanner.scanCharacters(from: itemDelimiter.inverted).map { subString in
let subScanner = Scanner(string: subString)
subScanner.charactersToBeSkipped = allDelimAndWhite
var key = subScanner.scanCharacters(from: kvDelimAndWhite.inverted)
subScanner.charactersToBeSkipped = nil //turn off skipping!!
let _ = subScanner.scanCharacters(from: kvDelimAndWhite)
let startValue:String.Index = subScanner.currentIndex
let value:String? = subScanner.string.substring(from: startValue)
if let key, let value {
dictionary[key] = value
}
}
}.map({ $0 })
return dictionary
}
Hand coded scanner
Last example, walking through a String, char by char, appending to the dictionary as we go.
- https://www.objc.io/blog/2019/02/05/a-scanner-alternative/
- https://talk.objc.io/episodes/S01E79-string-parsing-performance
- https://gist.github.com/milseman/f9b5528345db3a36bbdd138af52c5cda
static func parse_charScanner(_ inputString:String) -> Dictionary<String,String>{
var result: Dictionary<String,String> = [:]
var isKey = true
var currentKey = "".unicodeScalars
var currentValue = "".unicodeScalars
@inline(__always) func add() {
let key = String(currentKey).trimmingCharacters(in: .whitespacesAndNewlines)
let value = String(currentValue).trimmingCharacters(in: .whitespacesAndNewlines)
result[key] = value
}
@inline(__always) func flush() {
currentKey.removeAll()
currentValue.removeAll()
isKey = true
}
for c in inputString.unicodeScalars {
if isKey {
switch c {
case ":":
isKey = false
default:
currentKey.append(c)
}
} else {
switch c {
case "/":
add()
flush()
default:
currentValue.append(c)
}
}
}
add()
return result
}
I did try a few different variations, including using UnsafeBytes
and everything seemed slower or on par with this one.
Summary
I’ve got a handful of ways to get a Dictionary
out of my “/:” formatted string. They do handle malformed Strings differently, but overall
- The
Scanner
and the custom scanner are fastest. - The raw string regex is the most succinct, but the slowest.
- The split-split with the match of one gets it done with the least rigamarole.
None of these let me handle items that won’t be a String yet. Next post… turning a String
into a Date
.
This article is part of a series.
- Part 1: What if I just copy-paste from the web?
- Part 2: How do you get messages to Swift directly?
- Part 3: Okay, but how about all the way up to the View?
- Part 4: How to do some basic file handling?
- Part 5: How do custom Encoder's work?
- Part 6: And what can I make a custom Encoder do?
- Part 7: This Article
- Part 8: Date Parsing. Nose wrinkle.
- Part 9: What would be a very simple working Decoder?