Setting the Earlier, Later, Blanks, Zeroes and Transition Parameters
There are four inherently ambiguous situations in the data where a feature value could be variously interpreted as unknown, absent, zero or perhaps both absent and zero. In these situations you need to tell OptiPath what to do by setting the feature parameters Earlier, Later, Blanks and Zeroes appropriately. In the first two cases you must let OptiPath know how artifacts preceding or following those in your sample should be treated. For each feature you can indicate whether earlier or later artifacts should be considered as unknown quantities or if their values should be assumed to be zero or absent or both.
Also inherently ambiguous in the data are blanks and zeroes. Some people use blanks as a short hand notation for values of 0, particularly in occurrence seriation where the values would normally be 0 and 1. For simplicity's sake only the 1's are entered in the data matrix. At other times blanks are supposed to indicate a feature is missing or absent for an artifact (for example, the feature might be the color of an ornamental band on pottery and the feature might be absent on some artifacts). Alternatively, a blank could signify the absence of data rather than the absence of the feature - we simply don't know anything about this feature for this artifact. Similarly, zeroes could be interpreted as absences, as values, as absences and values, or as unknowns.
Whatever convention you are using, you need to let OptiPath know by setting the feature parameters Earlier, Later, Blanks and Zeroes.
The implication of treating earlier, later, blanks and zeroes as values, absences or unknowns is actually rather complicated. Consider the eleven features A, B, C, ..., K. In the following data matrix each row reflects an artifact and each column represents a feature. For simplicity we will consider data entries of 0 and 1. This is typical of occurrence seriation. Part of the complication is we do not know a priori whether a 0 or 1 represents a measurement of a feature or an indication of absence or presence of a feature or simply a lack of knowledge of a feature.
Features A B C D E F G H I J K Artifact 1 1 0 0 1 1 0 0 1 1 0 0 Artifact 2 1 0 0 1 1 0 0 1 1 0 0 Artifact 3 1 0 0 1 1 0 0 1 1 0 0 Artifact 4 1 0 0 1 1 0 0 1 0 0 1 Artifact 5 1 0 0 1 1 0 1 0 0 1 1 Artifact 6 1 0 1 1 0 0 1 0 0 1 1 Artifact 7 1 0 1 1 0 1 1 0 0 1 0 Artifact 8 1 0 1 0 0 1 1 0 1 0 0 Artifact 9 1 0 1 0 0 1 0 1 1 0 0 Artifact 10 1 0 1 0 0 1 0 1 1 0 1 Artifact 11 0 0 1 0 1 1 0 1 0 0 1 Artifact 12 0 1 1 0 1 0 0 1 0 1 1 Artifact 13 0 1 1 0 1 0 1 0 0 1 0 Artifact 14 0 1 1 0 1 0 1 0 0 1 0 Artifact 15 0 1 1 1 1 0 1 0 1 0 0 Artifact 16 0 1 0 1 0 0 1 0 1 0 1 Artifact 17 0 1 0 1 0 1 0 1 1 0 1 Artifact 18 0 1 0 1 0 1 0 1 0 0 1 Artifact 19 0 1 0 1 0 1 0 1 0 1 0 Artifact 20 0 1 0 1 0 1 0 1 0 1 0 Artifact 21 0 1 0 1 0 1 0 1 0 1 0 Score I 1 1 1 2 2 2 2 3 3 3 3 Score II 1 1 2 2 3 3 4 4 5 5 6 Score III 2 2 3 4 5 5 6 7 8 8 9
Typically, in occurrence seriation, the objective is to order, or seriate, the artifacts so that the result is exactly one "string" of consecutive 1's (uninterrupted by 0's) in each column (feature). In the matrix above, features A, B and C have exactly one string of ones. Features D, E, F and G have exactly two strings of ones. Features H, I, J and K have exactly three strings of ones.
The question is, in terms of our objective to create exactly one string of 1's in each column, how would we rank the columns? Is A better or worse than C? Each has exactly one string of ones. There are two complications. The first is whether a 0 indicates a measurement of a feature or an absence of a feature or simply a lack of knowledge of a feature. The second is what is happening before the earliest artifact and after the latest.
If 1 represents the presence of a feature and 0 represents the absence, then (ignoring what's going on earlier and later) there is no reason to prefer the arrangement in column A to column C. In either case the feature appears in the archaeological record, persists uninterruptedly for some time and then disappears. However, if 0 and 1 represent two possible feature values (for example, "red" and "blue"), then presumably column A would be preferable to column C. In this case, feature A is blue for all artifacts for an uninterrupted period of time followed by an extended period where artifacts are red without exception; but feature C is blue for a while, then red, and then blue again. For feature C, the blues are interrupted. In other words, column C has an interrupted string of 0's. If 0's and 1's both represent feature values (rather than absences and presences) then uninterrupted strings of 0's are just as important as uninterrupted strings of 1's.
If 0's represent unknowns, or the absence of data rather than the absence of a feature, then they are irrelevant. We cannot draw any conclusion in the absence of data and all eleven column patterns are equally desirable.
The situation is further complicated by considering artifacts earlier and later than those in the data. If earlier artifacts are assumed to have a feature value of 0, then feature A would be indistinguishable from feature C, which was not always the case otherwise as we saw above.
Ideally, in different situations you would like to be able to differentiate between columns A, B and C in the table above, each of which has one string of 1's. Similarly you might want to differentiate among D, E, F and G which each have two strings of ones, and among H, I, J and K which each have three strings of ones. The last three rows in the table introduce scoring systems which help you to discriminate between different types of columns.
The first scoring mechanism (Score I) is simply a count of the number of strings of ones, so it helps in minimizing the number of strings created, but does not help in differentiating among strings with different patterns.
The second scoring mechanism (Score II) goes a little further. It differentiates between A or B and C. Unfortunately it fails to differentiate between C and D. This is because Score II is simply a count of how many times in a column there is a transition from 0's to 1's and from 1's to 0's, ignoring what comes earlier or later. This is equivalent to (actually one less than) counting the number of strings (both 0's and 1's) in a column.
The third scoring mechanism (Score III) is even better. It differentiates between A or B and C and between C and D. However, it still fails to differentiate between A and B. However, this is to be expected since seriation simply creates an ordering but does not tell you if your ordering is going forward or backward in time (the Earlier and Later feature parameters in OptiPath actually give you some control over this) and columns A and B are symmetric. One is simply the reverse of the other. Score III is simply the sum of Score I and Score II.
To accomplish Score I (there may be more than one way to do it) you can set the following feature parameters: Earlier = Absent, Later = Absent, Zeroes = Absent, Transition penalty = 0.5. It does not matter which distance function (metric) you use.
To accomplish Score II you can set Earlier = Unknown, Later = Unknown, Zeroes = Absent, Transition penalty = 1. It does not matter which distance function (Metric) you use. Alternatively you could set Earlier = Unknown, Later = Unknown, Zeroes = Value, Transition penalty = 0 and use either the Manhattan or Hamming distance function.
To accomplish Score III you can set Earlier = Absent, Later = Absent, Zeroes = Value & Absent, Transition penalty = 0.5 and use either the Manhattan or Hamming distance function.
It is up to you to decide which of these scoring mechanisms (or some other) is best suited for your data. It is worth thinking about because even though the issues can be quite subtle the effects can be quite significant.
4/14/08