Personal Coding Challenge: Data Validation, Correction, and Default Handling

Say you have some JSON data like this (I’ll be using Go on purpose here):

jsonString := `{ "number": "1600", "street_name": "Pennsylvania", "street_type": "Avenue" }`
Code language: JSON / JSON with Comments (json)

The challenge here is to build a machine that does four things:

  1. Validate that the JSON is valid syntax.
  2. Validate that the JSON is valid against a schema.
  3. If there is a problem with one part of the data, first attempt to fix it (see how the value for "number" is "1600", that’s a basic mistake, try to coerce it into an int of 1600).
  4. If that part of the bad data cannot be coerced, or if it’s invalid even after it is, fall back to a default value.

1) Validation is fairly easy.

Go standard lib can do it.

jsonString := `{ "number": "1600", "street_name": "Pennsylvania", "street_type": "Avenue" }` isValid := json.Valid([]byte(jsonString)) if isValid { fmt.Println("JSON data is valid in most basic sense.") } else { fmt.Println("ERROR! String is not valid JSON.") }
Code language: Go (go)

2) Validation against schema can be done with a lib.

I don’t know if this is the only option, but gojsonschema works.

jsonString := `{ "number": "1600", "street_name": "Pennsylvania", "street_type": "Avenue" }` schema := gojsonschema.NewStringLoader(`{ "type": "object", "properties": { "number": { "type": "number" }, "street_name": { "type": "string" }, "street_type": { "enum": ["Street", "Avenue", "Boulevard"] } } }`) json := gojsonschema.NewStringLoader(jsonString) result, err := gojsonschema.Validate(schema, json) if err != nil { panic(err.Error()) } if result.Valid() { fmt.Printf("The document is valid\n") } else { fmt.Printf("The document is not valid. see errors :\n") for _, desc := range result.Errors() { fmt.Printf("- %s\n", desc) } }
Code language: Go (go)

3) Fixing theoretically easy to fix data

Here’s that original JSON again:

jsonString := `{ "number": "1600", "street_name": "Pennsylvania", "street_type": "Avenue" }`
Code language: Go (go)

And note in the schema, we’re expecting a number: "number": { "type": "number" }. If we ran the validation in Step 2 above, we’d get:

> make run
go build -o main .
./main
The document is not valid. see errors :
- number: Invalid type. Expected: number, given: string

So that’s true for schema validation, but it would also be true if we tried to json.Unmarshal the data into a struct with strict types (which we definitely do). So in addition the schema, we have a type which is also kind of a schema.

type address struct { Number int `json:"number"` StreetName string `json:"street_name"` StreetType string `json:"street_type"` }
Code language: Go (go)

If we tried to parse the JSON now, we’d get a similar error to the schema checking:

var add address err := json.Unmarshal([]byte(jsonString), &add) if err != nil { fmt.Println(err) }
Code language: Go (go)
> make run
go build -o main .
./main
json: cannot unmarshal string into Go struct field address.number of type int

The hope is that there is a way to run some kind of callback function to try to coerce the problematic bit of data into something that is valid. So "1600" is so obviously just incorrectly a string, the callback would force it into an int and all would be well.

This is where I’m kinda stuck, and will update this post when it’s figured out.

  • Can fastjson help? It’s README says something about error handling but I don’t see how.
  • Can mapstructure help? It says “This library is most useful when decoding values from some data stream (JSON, Gob, etc.) where you don’t quite know the structure of the underlying data until you read a part of it.” which seems like a good lead.
  • Can validator help?

Update: You can provide Unmarshaling instructions for custom types

This article was very helpful.

Rather than an int like you want it, call it something else:

type FlexInt int type FlexAddress struct { Number FlexInt `json:"number"` StreetName string `json:"street_name"` StreetType string `json:"street_type"` }
Code language: Go (go)

Now as Marko Mikulicic says:

All you have to do is implement the json.Unmarshaler interface.

So:

func (fi *FlexInt) UnmarshalJSON(b []byte) error { if b[0] != '"' { return json.Unmarshal(b, (*int)(fi)) } var s string if err := json.Unmarshal(b, &s); err != nil { return err } i, err := strconv.Atoi(s) if err != nil { return err } *fi = FlexInt(i) return nil }
Code language: Go (go)

This is awfully clever I think.

  1. It checks the first character of the value of the FlexInt and if it’s not a double-quote mark (like is required for a JSON string), then assume its an int and Unmarshal it that way.
  2. Then try to Unmarshal it as a string and return that if it works
  3. Then try to coerce it into an int and if that works, great
  4. Errors returned if nothing seems to work (could always try accounting for more situations)

So this handles trying to fix decently-easy-to-fix JSON type errors, and also gives an opportunity to just return some kind of default value if every attempt at fixing it fails.

4) Default / Fallback Values

One issue here is where to put the fallback values. If we know we’re exclusively dealing with JSON data, it seems like the JSON schema would be the place. That can look like:

{ "type": "object", "properties": { "number": { "type": "number", "default": 1000 }, "street_name": { "type": "string" }, "street_type": { "enum": ["Street", "Avenue", "Boulevard"] } } }
Code language: JSON / JSON with Comments (json)

But the trouble here is that by the time we’re parsing/unmarshaling the data, that’s in Go, so we’d have to somehow come back to the schema and parse that and pluck the data out to use. Just seems weird.

Maybe we’ll have to do Step 3, then if we find the data to be unfixable, remove it, then run it back through a JSON schema situation where it puts default values back into the parsed data. Again something I don’t really know how to do, but seems plausible. Plus it does double duty. I would think this machine would optionally be able to put in default values. Not always, sometimes missing fields are better, but it could put in defaults on command.

Bonus: Not just JSON

I think JSON is the primary use case here, but not all data is passed around as JSON. Perhaps this machine could do the same kind of thing for data that is already in Go. Step 1 becomes irrelevant (Go code will just choke on invalid syntax), but the rest still matter. Can a struct have a schema with allowed values? So not just int but an int with min and max? Not just a string but a string with a valid set of ENUM values. Seems like that should be no huge problem. Can a struct with a value outside what the schema allows be fixed or reverted to a default value? Hopefully?

In this case, wouldn’t it make more sense to put the default values in the type definition rather than a JSON schema, so like this NOT REAL code:

type address struct { Number int `json:"number",default:1000` StreetName string `json:"street_name"` StreetType string `json:"street_type"` }
Code language: JavaScript (javascript)

Then if you need JSON schema also, you could generate it from this type? I’m already out of my depth here and this is doubly so, but also seems possible.

2 responses to “Personal Coding Challenge: Data Validation, Correction, and Default Handling”

  1. Curtis Wilcox says:

    “1600” is so obviously just incorrectly a string

    To a human it is (though a human could mistake “16OO”). I guess if the expected type is Int, you could have a routine that checks each character in the string to see if it matches the ASCII digits then cast it to an int if they do.

    Regarding the bonus question, I don’t know Go at all but my hunch is structs are intentionally too primitive to contain something like a valid range for a typed value unless it’s as “NumberMin” and “NumberMax” ints and you write an .isValid() method that uses such values, if present.

    If this isn’t hypothetical, the address number should probably remain a string, otherwise you’re in “Falsehoods programmers believe about addresses” territory.

    • Chris Coyier says:

      I guess if the expected type is Int, you could have a routine that checks each character in the string to see if it matches the ASCII digits then cast it to an int if they do.

      Yep looks like thats part of the solution! This is helping me:
      https://docs.bitnami.com/tutorials/dealing-with-json-with-non-homogeneous-types-in-go/

      If this isn’t hypothetical, the address number should probably remain a string

      It is totally hypothetical. I know what you mean about addresses. My goal here is error correction and defaults for data generally and this was just an easy example.

Leave a Reply

Your email address will not be published.