Wait, how do I scan text again?

March 27, 2024

This article is part of a series.

The encoder spun our data out into a thread using serialization. A decoder crochets that thread back into an object.

If the data coming into the decoder is highly structured and narrow in scope it can actually be kind of relaxing to parse. In the Swift Talk decoder example (video (paywall) | github) the parsing ends up being pretty straight forward because their encoder only takes in a Route, and the decoder only produce a Route.

The general purpose mayhem of taking in any Codable that SimpleCoder performs is the whole point of an Encoder/Decoders pair (they are about the format, not the type) but it’s a lot to tackle in one bite. I’m not even using brackets to help calculate object relationships. Oof.

As a result making the decoder will be a much slower job.

remember how to inspect strings in general
decide about Dates specifically
write a really really simple decoder to walk through the process
inspect other peoples decoders
write the string -> tree step for SimpleCoder
finish the full decoder (with round trip tests)

In this post

First on the list, I’m going to catalog all the ways I know how to scan strings for information.

String.builtinThing (0.000195)
Moving the index (0.000113)
Regex (\: 0.000399, builder: 0.000176)
Scanner (v1 0.0000439, v2 0.0000512)
Scanner + Dynamic Sequence (0.0000261)
Custom char scanner (0.0000246)

The number in parenthesis is the average time to turn one example string into a Dictionary according to my XCTest set up. I’m focused on their relative order of magnitude.

    func testStaticParserJ() throws {
        measure {
            do {
                let dictJ = try HousePlantTest.parse_charScanner4(serializedHousePlant)
            } catch {
                fatalError()
            }
        }
    }

I’m going to delay talking about making a Parser Combinator or other more substantial parser class because that would be complete overkill for a “MyFirstDecoder” Decoder, even for me.

Using built in String functions

Let’s imagine there is a HousePlant struct and an encoded example. As long as we can get that string into a Dictionary<String, String>, we can make a HousePlant out of it. (I’m going to punt on Date parsing for now)

//Playground style
struct HousePlant:Codable {
    let commonName:String
    let whereAcquired:String?
    let dateAcquired:String
    let dateOfDeath:String?

    static func parse(_ inputString:String) throws -> Dictionary<String,String> {
        let fullKeyValueList = inputString.split(separator: "/")
        var result:Dictionary<String,String> = [:]
        return fullKeyValueList.reduce(into: result){ result, item in
            let kvSplit = item.split(separator: ":")
            let key = String(kvSplit[0]).trimmingCharacters(in: .whitespacesAndNewlines)
            let value = String(kvSplit[1]).trimmingCharacters(in: .whitespacesAndNewlines)
            result[key] = value
        }
    }
}

extension HousePlant {
    init?(_ inputString:String) {
        guard let dictionary = try? HousePlant.parse(inputString) else {
            return nil
        }
        print("made dictionary", dictionary)
        if let cN = dictionary["commonName"] {
            print("commonName")
            self.commonName = cN
        } else {
            return nil
        }
        if let dA = dictionary["dateAcquired"] {
            print("dateAcquired")
            self.dateAcquired = dA
        } else {
            return nil
        }
        self.whereAcquired = dictionary["whereAcquired"]
        self.dateOfDeath = dictionary["dateOfDeath"]
    }
}
let serializedHousePlant = "commonName:spider plant/whereAcquired:Trader Joe's/dateAcquired: 2024-03-12"
let housePlant = HousePlant(serializedHousePlant)
print(housePlant ?? "no plant")

String and StringProtocol have many built in functions to do basic scanning. In this case a split and a trim were good enough.

Move the index

The current parser works, but it could get into trouble. Imagine our starting text was instead

"commonName: spider plant /whereAcquired :Trader Joe's/dateAcquired: 2024-03-12 16:12:07"

The value contains a delimiter.

One way to accommodate strings with the delimiter character in them is to just stop at the first example of the delimiter. The quick and dirty way to do that would be to update the the current code to .split(separator: ":", maxSplits: 1)

Alternatively we can switch to moving through the string with a String.Index. An advantage being there’s no risk of throwing an index out of bounds error anymore.

    static func parse_stringIndex(_ inputString:String) throws -> Dictionary<String,String> {
        let fullKeyValueList = inputString.split(separator: "/")
        var result:Dictionary<String,String> = [:]
        return try fullKeyValueList.reduce(into: result){ result, item in
            if let firstColonIndex = item.firstIndex(of: ":") {
                let key = item
                    .prefix(upTo: firstColonIndex)
                    .trimmingCharacters(in: .whitespacesAndNewlines)
                let value = item
                    .suffix(from: item.index(firstColonIndex, offsetBy: 1))
                    .trimmingCharacters(in: .whitespacesAndNewlines)
                result[String(key)] = String(value)
            } else {
                throw HousePlantError.notAKeyValuePair
            }
        }
    }

RegEx example

I love regular expressions, although some do not. Understanding RegEx as a DSL for writing lexical analysis state machines improved how I use them. I’ve documented using them in Swift before to detect the header of a USD file (RegExBuilder version, back tracking to a literal)

SIDEBAR: To see the pros detect a header look at the Package Manger ToolsVersionParser. It’s a 685 line file.

The non-regex version works fine for this, so I don’t think I’d actually bother in a real project.

I typically start making a regex by opening https://regex101.com. The scanner will need a to capture a repeated group. The below examples were created with the /gm flags and the original test string.

Here’s the evolution process:

([^\/:]+) : works. simple. Tada. (one or more “not / or :” greedy.)(every item, could interleave.)
(?:(?:\A|\/)([^\/]+)(?:$|\/)) : Explicit “between /”, consumes the /. Doesn’t grab middle groups.
(?<=\/|^)(.*?)(?=\/|$) : look ahead and behind version switch to .*? (not greedy anything until delimiter)
(?<=\/|^)(?:(.*?):(.*?))(?=\/|$) : splits
(?:^|\A|\G)(.*?)(?:\/|$) :remove lookbehind, no split
(?:^|\A|\G)(?:(.+?):(.+?))(?:\/|$) : return the split
(?:^|\A|\G)(?:\s*(.+?)\s*:\s*(.+?)\s*)(?:\/|$) : trim white space
(?:^|\A|\G)(?:\s*(?<key>.+?)\s*:\s*(?<value>.+?)\s*)(?:\/|$) : add names

The final regex actually works for the new test string with the extra colon in it since the non-greedy indicator (the ?in (.+?)) will stop at the first : match.

static func parse_regex(_ inputString:String) throws -> Dictionary<String,String>{
    let pattern = /(?:^|\A|\G)(?:\s*(?<key>.+?)\s*:\s*(?<value>.+?)\s*)(?:\/|$)/
    var dictionary:Dictionary<String, String> = [:]
    let matches = inputString.matches(of: pattern)
    for match in matches {
        dictionary[String(match.output.key)] = String(match.output.value)
    }
    return dictionary
}

NOTE: to get this to run in a Swift Package (swift-tools-version: 5.10) with the /regex/ notation add the following to the Package.swift file.

let swiftSettings: [SwiftSetting] = [
    .enableUpcomingFeature("BareSlashRegexLiterals"),
]

for target in package.targets {
    target.swiftSettings = target.swiftSettings ?? []
    target.swiftSettings?.append(contentsOf: swiftSettings)
}

The same regex done with the RegEx Builder would look something like (generator to get started):

static func parse_regexLong(_ inputString:String) -> Dictionary<String,String>{
    let key = Reference(Substring.self)
    let value = Reference(Substring.self)
    let pattern = Regex {
        ChoiceOf {
            Anchor.startOfLine //^
            Anchor.startOfSubject //A
            Anchor.firstMatchingPositionInSubject //G
        }
        Regex {
            ZeroOrMore { CharacterClass.whitespace }
            Capture(as: key) {
                OneOrMore(.reluctant) { CharacterClass.any }
            }
            ZeroOrMore { CharacterClass.whitespace }
            ":"
            ZeroOrMore { CharacterClass.whitespace }
            Capture(as: value) {
                OneOrMore(.reluctant) { CharacterClass.any }
            }
            ZeroOrMore { CharacterClass.whitespace }
        }
        ChoiceOf {
            "/"
            Anchor.endOfLine
        }
    }
    var dictionary:Dictionary<String, String> = [:]
    let matches = inputString.matches(of: pattern)
    for match in matches {
        dictionary[String(match[key])] = String(match[value])
    }
    return dictionary
}

One can use bare slash syntax in the builder!


let pattern = Regex {
    /(?:^|\A|\G)/
    /(?:\s*(.+?)\s*:\s*(.+?)\s*)/
    /(?:\/|$)/
}

Do watch the WWDC22 Meet RegEx talk. It’s hilarious and informative.

Scanner example

A number of examples use the built in Scanner class. Scanner works like an Encoder in that you give the Scanner the string to analyze and it holds it in its own memory.

static func parse_scanner(_ inputString:String) throws -> Dictionary<String,String> {
    let scanner = Scanner(string: inputString)
    var dictionary:Dictionary<String,String> = [:]
    while !scanner.isAtEnd {
        var key = scanner.scanUpToString(":")
        key = key?.trimmingCharacters(in: .whitespacesAndNewlines)
        let _  = scanner.scanCharacter()
        var value = scanner.scanUpToString("/")
        value = value?.trimmingCharacters(in: .whitespacesAndNewlines)
        let _  = scanner.scanCharacter()
        if let key, let value {
            dictionary[key] = value
        } else {
            throw HousePlantError.notAKeyValuePair
        }
    }
    return dictionary
}

Notice I’m still trimming the whitespace. The default charactersToBeSkipped settings of the scanner will make the Scanner ignore any whitespace outside of a scan. For this string that means it will skip any “leading” white space for our keys and values (at the start of the string, immediately after the “:” or immediately after “/”). Once the scanner considers itself mid-scan it will no longer ignore the characters (discussion). That leaves all the trailing whitespace to lop off (and no .trimSuffix to do it).

Another Scanner example, this time adding the delimiters to the skipped characters:

static func parse_scanner2(_ inputString:String) throws -> Dictionary<String,String> {
    var dictionary:Dictionary<String,String> = [:]
    
    let itemDelimiter = CharacterSet(charactersIn: "/")
    let keyValueDelimiter = CharacterSet(charactersIn: ":")
    let kvDelimAndWhite = CharacterSet()
        .union(keyValueDelimiter)
        .union(.whitespacesAndNewlines)
    let allDelimAndWhite = CharacterSet()
        .union(itemDelimiter)
        .union(keyValueDelimiter)
        .union(.whitespacesAndNewlines)
    
    let scanner = Scanner(string: inputString)
    scanner.charactersToBeSkipped = allDelimAndWhite

    while !scanner.isAtEnd {
        var key = scanner.scanCharacters(from: kvDelimAndWhite.inverted)
        var value = scanner.scanCharacters(from: itemDelimiter.inverted)
        value = value?.trimmingCharacters(in: .whitespacesAndNewlines)
        
        if let key, let value {
            dictionary[key] = value
        } else {
            throw HousePlantError.notAKeyValuePair
        }
    }
    return dictionary
}

In this example a key should never have whitespace so the Scanner can run through whitespace characters as part of the delimiter search. No more trimming for the key.

Scanner + Dynamic Sequence

I once saw an interesting demo of combining a Scanner with a Dynamic Sequence (SE-0094 Review Post), which works especially well if one had multiple items to dig through. We aren’t parsing multiple HousePlants yet, so I’ll put a different example in for reference.

    func numbersInString(_ input:String) {
        print("numbersInString")
        let result = sequence(state: Scanner(string: input)) { scanner in
            let string = scanner.scanCharacters(from: .decimalDigits.inverted )
            print(string ?? "none found")
            return scanner.scanInt()
        }.map({ $0 })
        print(result)
    }
    
    func numbersInSubString(_ input:String) {
        print("numbersInSubString")
        let result = sequence(state: Scanner(string: input)) { scanner in
            scanner.scanUpToString("\n").map { subString in
                print(subString)
                return sequence(state: Scanner(string: subString)) { subScanner in
                    let string = subScanner.scanCharacters(from: .decimalDigits.inverted )
                    print(string ?? "none found")
                    return subScanner.scanInt()
                }.map({ $0 })
            }
        }.map({ $0 })
        print(result)
    }

    func testParse() {
        numbersInString("hda23hfw78hdjila2889\nhda7991hfw12hdjila9\nhufie281sufvns0938dhqqj8837")
        //[23, 78, 2889, 7991, 12, 9, 281, 938, 8837]

        numbersInSubString("hda23hfw78hdjila2889\nhda7991hfw12hdjila9\nhufie281sufvns0938dhqqj8837")
        //[[23, 78, 2889], [7991, 12, 9], [281, 938, 8837]]
    }

This style provides a chance for the Scanner to just jump to the end of the Substring when getting a value like with the String.split("/") examples. Apparently it still saves some time even though the key chars probably get scanned twice.

static func parse_subScanner(_ inputString:String) throws -> Dictionary<String,String> {
    var dictionary:Dictionary<String,String> = [:]
    
    let itemDelimiter = CharacterSet(charactersIn: "/")
    let keyValueDelimiter = CharacterSet(charactersIn: ":")
    let kvDelimAndWhite = CharacterSet()
        .union(keyValueDelimiter)
        .union(.whitespacesAndNewlines)
    let allDelimAndWhite = CharacterSet()
        .union(itemDelimiter)
        .union(keyValueDelimiter)
        .union(.whitespacesAndNewlines)
    
    let topScanner = Scanner(string: inputString)
    topScanner.charactersToBeSkipped = allDelimAndWhite
    
    sequence(state: topScanner) { topScanner in
        topScanner.scanCharacters(from: itemDelimiter.inverted).map { subString in
            let subScanner = Scanner(string: subString)
            subScanner.charactersToBeSkipped = allDelimAndWhite
            var key = subScanner.scanCharacters(from: kvDelimAndWhite.inverted)
            subScanner.charactersToBeSkipped = nil //turn off skipping!! 
            let _ = subScanner.scanCharacters(from: kvDelimAndWhite)
            let startValue:String.Index = subScanner.currentIndex
            let value:String? = subScanner.string.substring(from: startValue)
            if let key, let value {
                dictionary[key] = value
            }
        }
    }.map({ $0 })
    return dictionary
}

Hand coded scanner

Last example, walking through a String, char by char, appending to the dictionary as we go.

static func parse_charScanner(_ inputString:String) -> Dictionary<String,String>{
        var result: Dictionary<String,String> = [:]
        var isKey = true
        
        var currentKey = "".unicodeScalars
        var currentValue = "".unicodeScalars
        
        @inline(__always) func add() {
            let key = String(currentKey).trimmingCharacters(in: .whitespacesAndNewlines)
            let value = String(currentValue).trimmingCharacters(in: .whitespacesAndNewlines)
            result[key] = value
        }
        
        @inline(__always) func flush() {
            currentKey.removeAll()
            currentValue.removeAll()
            isKey = true
        }
        
        for c in inputString.unicodeScalars {
            if isKey {
                switch c {
                case ":":
                    isKey = false
                default:
                    currentKey.append(c)
                }
            } else {
                switch c {
                case "/":
                    add()
                    flush()
                default:
                    currentValue.append(c)
                }
            }
        }
        add()
        return result
    }

I did try a few different variations, including using UnsafeBytes and everything seemed slower or on par with this one.

Summary

I’ve got a handful of ways to get a Dictionary out of my “/:” formatted string. They do handle malformed Strings differently, but overall

The Scanner and the custom scanner are fastest.
The raw string regex is the most succinct, but the slowest.
The split-split with the match of one gets it done with the least rigamarole.

None of these let me handle items that won’t be a String yet. Next post… turning a String into a Date.