A critical view on the FDA and PMDA SDTM validation rules

About a year ago, we (me and some CDISC volunteers) started rewriting all the FDA and PMDA validation rules and some of the CDISC ADaM Validation Checks v.1.3 in XQuery, a W3C standardized language that is at the same time human-readable and machine-executable, and allows to write validation rules that are completely transparent. This also means that, as the rule is at the same time the implementation, there is no room for different interpretations. The use of XQuery also allow to make the rules themselves completely independent of the software with which the validation is performed. Once the rules are there, anyone can easily write his own software, and all will get the same outcomes when applied to the same set of data.

You can follow the progress of our work here.

We started this work out of frustration with the way the rules were published and with the way they were implemented by OpenCDISC, as:

some of the rules were not clear
some of the rules were just terribly wrong
some of the rules were so that in order to obey them, you need to break another rule
intransparent - you can't see how the rules were implemented in the software
not open - the rules are hidden in some source code that takes you hours to reconstruct how the rule is really implemented (even as a Java specialist)
the software implementation leads to massive amounts of false positives
the software implementation does other things than what the rule states

As a consultant, I regularly get questions and complaints from my customers who are generating submission datasets concerning error messages originating from the OpenCDISC (now Pinnacle21) Validator software. Many of these are about different interpretations of the rules and about (what my customers consider to be) false positives.
So I regularly post question on the forum, asking for more clarity, to correct rules, to remove incorrect rules, and to improve transparency.
I do ask these on this forum, as I do suspect that these rules have been written by this company on behalf of the FDA, and not eligible people at the FDA itself. I several times send detailed questions and requests for major corrections to the FDA, but never got any answer.

I have been criticized by some people for doing all this - they cannot understand why I have problems with these rules and how they have been implemented. So they have asked me to describe what problems I do exactly have with these rules, and to explain them how it can be done better (if at all).

And that is exactly what I will do in this article.

This article is "work in progress" and will remain so. I will refine it as I continue the work on the rules, and as people will comment on this article - please do so!.

General comments on all the rules

There are some fundamental problems with the wording of the rules. Obviously, they have not be written by "rule specialist". Here are some of them:

A good rule has a precondition and a postcondition. These are completely lacking here
Wording like "like", "e.g." or "is expected to" do not belong in rules. Rules must be precise, not fuzzy
Wording like "is expected to ..." or even "should" must be avoided in rule definitions. Both describe an expectation. Much better is to use "must" and "may not"
Error and warning messages should be descriptive, i.e. should allow the user to find exactly out where and why a rule was violated. Ideally, they should contain the variable name and the variable value that violated the rule, and when applicable, the value that is the correct value.
We even found that the error/warning message did not fit with the rule at all - this is of course fatal for the user.

Furthermore, there are some serious implementation issues:

"Define.xml is leading", meaning that the define.xml is "the sponsor's truth". The software implementation seems to ignore the define.xml almost completely when SDTM datasets are being validated. Ideally, each rule implementation should first look into the define.xml, pick up some metadata, and than use that together with the rule itself to perform the validation
In some cases, we found that the software implementation of a rule does not correspond with the text of the rule. A famous example is rule FDAC154 "Missing value for -ORRESU, when -ORRES is provided", which, when implemented in software, would lead to millions of errors (as the rule is incorrect - see further), but doesn't, as in the software, exceptions have been hardcoded, but we just don't know how. Many other examples of cases where the software seems to do other things than what is described in the software can be found on the forum

Comments on the FDA/PMDA SDTM rules in detail

Here is a list of the rules. It contains the rule ID, the rule description as copied from the FDA/PMDA Excel worksheet, and my own comments (third column).
Rule IDs starting with "FDA" were published by the ... FDA, rules starting with "SD" were published by PMDA.

Rule ID	Description	Comments Jozef
FDAC013 / SD1030	Domain table must have a valid format (e.g., SAS transport (XPORT) v.5 or text-delimited)	Unclear rule: first of all the wording "e.g." does not belong in a rule. Also, what is "text delimited"? Would e.g. HL7-v2 ("\|" delimited with "^" for components within fields) be accepted by the FDA? Validation rules need to be clear and precise. This rule isn't at all.
FDAC017 / SD0023	Variables described in SDTM as Required must be included in the dataset	Unclear wording: the rule can only be understood when one first reads the next rule (FDAC018) stating that the variable value for "required" records cannot be null. After reading rule FDAC018, one can understand that rule FDAC017 means that the table column for each "required" variable must be present in the (SAS-XPT) dataset. Nothing is however said about the definition of these variables in the define.xml - that is at even more important.
FDAC020 / SD0057	Variables described in SDTM as Expected should be included in the dataset	Same problem as with FDAC017.
FDAC021	Variables requested by FDA in policy documents should be included in the dataset. E.g., EPOCH and ELEMENT	Unimplementable rule: This is a rule that cannot be implemented in software, as the "FDA policy documents" are not machine-readable. This kind of rules opens the door for different interpretations of these documents. For example, what means "requested"? Give 5 people the same FDA policy document, and they will come with 6 different lists of "variables that are requested". Bad wording: the wording "e.g." does not belong in rules.
FDAC022	According to FDA expectations, a treatment-emergent flag should be included in SUPPAE according to SDTM IG v3.1.2 #8.4.3	Unclear rule: what is described in #8.4.3 is an example, and not normative. The better way would have been to state that a SUPPAE variable with name "AETRTEM" and label "Treatment Emergent Flag" with allowed values "Y" and "N" must be described in the define.xml, and that the variable must be populated for each corresponding record in the AE dataset. I was also surprised that I could not find a definition at all in the SDTM-IGs what the exact definition of "Treatment Emergent Flag" is, and what the criteria are to set it to "Y" or "N" (it is marked as "derived").
FDAC027 / SD0058	Only variables listed in SDTM model should appear in a dataset. New sponsor defined variables must not be added, and existing variables must not be renamed or modified	Second part: Unimplementable rule: it is unclear what is exactly meant by "existing variables must not be renamed or modified". Is it meant that the label should be exactly as in the SDTM-IG? Or that an other datatype is not allowed to be used? Rules must be precise. The second part of this rule isn't at all.
FDAC029 / SD1074	Variables designed only for SEND pre-clinical studies must be not included in the SDTM dataset	Unimplementable rule: How can an application know whether a variable is "designed only for SEND"? The only way I can imagine is that a human goes through the SEND-IG document and makes a list of variables that are marked as such. Again, give 5 people the SEND-IG and they will probably come with 6 different lists of "Variables designed only for SEND pre-clinical studies". The better was would have been that the FDA publishes lists (for each SDTM-IG version) of variables it defines as "SEND only". I found that the SDTM-IG 3.2 has a section about this (#2.7 SDTM variable allowed in SDTMIG). The rule could at least have pointed to this section.
FDAC030 / SD1075	Variables described in IG as not recommended for usage should be not included in the dataset	Unclear rule: I searched in the SDTM-IGs for the wording "not recommended" in had ... 0 hits. The SDTM-IGs have "assumptions" for each domain stating things like "The following ... would not generally be used in ...". Is that meant by "not recommended"? I asked our English-professor, and she said that "not recommended" and "not generally be used" means two completely different things. As one can understand, implementation of this rule will "usually" ;-) lead to a large amount of false positives, as "generally" also means that there are many exceptions that are completely legal.
FDAC031 / SD1076	SDTM model variable may be added into standard domains according its domain general class, if there are no restrictions on their usage specified in IG.	Unclear rule / unimplementable rule: What does this rule mean? I did not find the wording "restriction" in relation to the usage of variables in the SDTM-IG. Or is the rule identical/similar to rule FDAC030?
FDAC032 / SD0055	Variable Data Types in the dataset should match the variable data types described in SDTM	Improvable: this rule can be written much more clear. The reason is that it only applies to SAS Transport 5 (remember rule FDAC013 stating that text files are allowed). For SAS Transport 5, it should state that the variable data type as defined in the header of the SAS Transport 5 file (either "char" or "numeric") must match the data type provided in the SDTM-IG in the domain variable tables.
FDAC033 / SD0063	Variable Label in the dataset should match the variable label described in SDTM. When creating a new domain Variable Labels could be adjusted as appropriate to properly convey the meaning in the context of the data being submitted.	Improvable / hard to implement: This is probably the most contested rule, and the one causing the most (often false) positives in the classic implementation. The problem is: "what is the label in SDTM"? Look in the SDTM-IG, choose a domain chapter. What is the label? There is no sentence there that says "Label:". Instead, we more or less need to guess that e.g. "Inclusion/Exclusion Criteria Not Met" is the label. Essentially, the SDTM-IG should be more explicit and say: "Label: Inclusion/Exclusion Criteria Not Met". How can we otherwise ever come to a machine-readable SDTM-IG? Also problematic is that we often see that within a domain description in the SDTM-IG, there is misleading content, e.g. that a label on an example table is written differently than at the top of the domain description. Another thing that often causes false positives is labels for variables that are in the model, but not in the specific domain in the SDTM-IG. Yet another problem is that CDISC has published variable labels (especially on the value level) that are longer than 40 characters. Strict implementation of the rule will in such a case always lead to an error when SAS Transport 5 is used. Essentially, all this is ridiculous: istn't the label there to BETTER explain to the reviewer what the variable is about? I have seen cases where the sponsor slightly changed the label to IMPROVE THE QUALITY, i.e. to better explain what the variable means in the context of his submission. *Rule FDAC033 punishes you for improving the quality of your submission!*. Also see my blog at: http://cdiscguru.blogspot.com/2015/12/sdtm-labels-freedom-or-slavery.html.
FDAC034 / SD0059	Variable Data Types in the dataset must match the variable data types described in the data definition document (define.xml)	Unimplementable rule: SAS Transport 5 only knows 2 datatypes: "char" and "numeric". These datatypes do not exist in define.xml. The latter has > 10 datatypes. Although a partial mapping is possible, this isn't sufficient for implementing in a rule (for example "date" in define.xml is "char" in SAS Transport 5).
FDAC036 / SD1082	Variable length should be assigned based on actual stored data to minimize file size. Datasets should be re-sized to the maximum length of actual data used prior to splitting.	Unclear wording: to make such a rule implementable, it must be defined more exact. E.g. The variable length of a variable as defined in the header of the SAS Transport 5 file may not exceed the length of the longest variable value for that variable in that SAS Transport file. This rule has caused a lot of problems as its implementation seems to be deviating from what the rules says. See here for examples.
FDAC037 / SD0037	Variable values should be populated with terms found in the user-defined codelist associated with the variable in define.xml	Non-precise wording: what is meant by "user-defined codelist"? Is an extended CDISC codelist "user-defined"? And what is then a "non-user-defined codelist"?
FDC038 / SD0003	Value of Dates/Time variables (*DTC) must conform to the ISO 8601 international standard	Non-precise wording: ISO-8601 contains more than just dates and datetimes. It also contains standard formats for time and for duration. If we take the rule literaly, then also "-P21D" should be a valid value for a *DTC (date/time of collection) variable. It isn't obviously. The rule should at least state "... must conform to the ISO 8601 standard for dates and datetimes".
FDAC039 / SD1011	Value of Duration, Elapsed Time, and Interval variables (--DUR, --ELTM, --EVLINT) must conform to the ISO 8601 international standard	Non-precise wording: same problem as in rule FDAC038. If we take the rule literaly, then also "2016-01-02" should be a valid value for e.g. --DUR variables. It isn't obviously. The rule should at least state "... must conform to the ISO 8601 standard for durations".
FDAC047 / SD0086	All SUPPQUAL Domains records are associated with a single parent record and must have a unique combination of Study Identifier (STUDYID), Unique Subject Identifier (USUBJID), Identifying Variable (IDVAR), Identifying Variable Value (IDVARVAL) and Qualifier Variable Name (QNAM) variable value. For all SUPPQUAL domain record all The QNAM variable should be unique for each parent record.	Incorrect rule: This rule is not correct at all! The first part states that each SUPPxx record can only be associated with a single parent record. This is simply not true! A quick inspection of the SDTM-IG (section 8.4.3 in the SDTM-IG 3.2) shows us that there are different cases where a SUPPxx record is associated with a group of records. Examples are when IDVAR is e.g. --GRPID or --CAT. The second part of the rule "For all SUPPQUAL domain record all The QNAM variable should be unique for each parent record" obviously has a typo: I do not have any idea what is meant here.
FDAC049 / SD0079	Subjects that have withdrawn from a trial before assignment to an Arm (ARMCD='NOTASSGN') should not have any Exposure records	Incomplete rule? What with subjects that were not treated (ARMCD='NOTTRT') or were screen failures (ARMCD='SCRNFAIL')? Are they allowed to have exposure records in th EX domain?
FDAC051 / SD1044	Baseline Flag (--BLFL) should be present in all custom Findings domains	Improvable: "present" (probably) means that the variable should exist (but that is nowhere stated in the FDA rules). A better description would have been: "A Baseline Flag variable (--BLFL) must be present as a column in all datasets representing custom findings domains. The variable must also be defined for such a custom domain dataset in the define.xml".
FDAC054	Date/Time of Collection (--DTC) should be less than or equal to the Start Date/Time of the latest Disposition Event (DSSTDTC)	Implementation issue: this problem often causes problems in the implementation, but that is more due to a quality problem: we often see that DSSTDTC is given as a date, whereas e.g. LBDTC is given as a datetime (or the other way around). Comparing dates with datetimes is difficult ...
FDAC055 / SD1078	Premissible variable whould not be present in domain, when the variable has missing value for all records in the dataset	Implementation issue: this rule can conflict with other rules. For example, the FDA requires that some permissible variables must be present (as a column) anyway. When there is no data for such a column, this will violate this rule, although the variable has been added in order to not violate another rule.
FDAC056	Domain Abbreviation (DOMAIN) variable should be consistent with the name of the dataset	Unclear rule: what is meant here? What is meant by "consistent"? Does it mean that the first two characters of the dataset name (case-sensitive or case-insentitive?) must correspond to the two characters of the domain name, with the exception for SUPPxx datasets where there is no "DOMAIN" variable? Or is something else meant? Like in "QSCG" (dataset name) and "QS" (domain name). Is that is meant, please state so precisely!
FDAC057 / SD0017	The value of Name of Measurement, Test or Examination (--TEST) should be no more than 40 characters in length	Implementation issue: unfortunately, CDISC has published test names in its controlled terminology that are longer than 40 characters. As a sponsor, what should I do in such a case? Cut after 40 characters (violating other rules about codelists) or violate this rule? Nobody knows ...
FDAC061 / SD0066	Planned Arm Code (ARMCD) values should match entries in the Trial Arms (TA) dataset, except for subjects who failed screening (ARMCD = 'SCRNFAIL') or were not fully assigned to an Arm (ARMCD = 'NOTASSGN')	Incomplete rule? What about subjects that were randomized but not treated (ARMCD=NOTTRT). Should "NOTTRT" appear as a planned arm in the TA dataset? I have doubts...
FDAC063 / SD0068	A value for Inclusion/Exclusion Criterion Short Name (IETESTCD) must be present the Trial Inclusion/Exclusion Criteria (TI) domain	Improvable: the wording of this rule can be considerably improved to make it better understandable and more exact. E.g.: The value of IETESTCD in the IE dataset must be identical to one of the values of IETESTCD in the TI dataset.
FDAC068 / SD1053	Records for subjects who failed a screening or were not assigned to study treatment (ARMCD is 'SCRNFAIL' or 'NOTASSGN') should not be included in the Trial Arms (TA) or Trial Visits (TV) datasets	Confusing rule: The TA and TV domains do not contain records about subjects! So what is meant here? Probably it is meant that "SCRNFAIL" and "NOTASSGN" may not appear in TA as a value for ARMCD. If so, why isn't written so?
FDAC071 / SD1049	Qualifier Variable Label (QLABEL) value may have up to 40 characters	Improvable: "may have" is not a good statement in a rule. Suppose that there is the rule "you may drive though green traffic lights"... Better: "Qualifier Variable Label (QLABEL) value may not have more than 40 characters.
FDAC072 / SD1095	Split datasets must not have name of original domain. E.g., lbhm.xpt is a valid name, when lb.xpt is not a valid dataset name for split domain.	Improvable: How do we know that a dataset is "splitted"? We do know because the dataset name is more than 2 characters, of which the 2 first characters are identical to the two characters of the domain name (all this from the define.xml). The only thing we can do is to count the number of datasets for a single domain (in the define.xml) and when there is more than one of them for the same domain, check whether the dataset name is more than two characters of which the first two are from the domain name. Also remark that the rule speaks about "dataset name". This is not entirely correct, as what is given in the example "lbhm.xpt" is the file name. The dataset name is LBHM.
FDAC074 / SD0077	Reference record defined by Related Domain Abbreviation (RDOMAIN), Unique Subject Identifier (USUBJID), Identifying Variable (IDVAR) and Identifying Variable Value (IDVARVAL) must exist in the target Domain	Improvable: better "referenced record or records defined by ...", this as a single record e.g. in SUPPxx can point to a group of records in the target domain. Not a bad idea also to state some way or another that the target domain is represented by the RDOMAIN variable value.
FDAC076 / SD1014	Order of Element within Arm (TAETORD) values should match entries in the Trial Arms (TA) dataset	Unclear rule: what is meant here? TAETORD is a number. Is it really sufficient that this number appears in TA-TAETORD? This works well when all arms have the same number of elements. Suppose for example that arm A has 3 elements but arm B has 5 elements. Would it then be OK that when there is an SE record for a subject in Arm A, the value of TAETORD in SE is "5"? "5" is a TAETORD value in TA isn't it? Or is it meant that the combination of ARMCD, ETCD and TAETORD must be found as a record in TA?
FDAC078 / SD1128	Relationship Type (RELTYPE) variable values should be populated with terms 'ONE', 'MANY', when Related Records (RELREC) dataset is used to identify relationships between datasets.	Improvable: How can an application see whether "RELREC dataset is used to identify relationships between datasets"? Is it by the absence of a value for USUBJID? When yes, this could be stated as a precondition, as is good practice in rule writing.
FDAC079 / SD1015	Epoch (EPOCH) values should match entries in the Trial Arms (TA) dataset	Incomplete rule? Is this suffient? If two arms have different epochs, ... Shouldn't it be that the combination of ARMCD and EPOCH must be found as a record in TA?
FDAC082 / SD0015	Non-missing Duration of Event, Exposure or Observation (--DUR) value must be greater than or equal to 0	Improvable: From the SDTM-IG (not machine readable) we know that durations MUST be expressed in the ISO 8601 "duration" format (e.g. P10D for "10 days"). The rule should state this and add that negative durations are characterized by a minus sign at the start (e.g. "-P10D"). Looks obvious to humans, but not to machines ...
FDAC084 / SD0007	Standard Units (--STRESU) must be consistent for all records with same Short Name of Measurement, Test or Examination (--TESTCD), Category (--CAT), Specimen Type (--SPEC) and Method of Test or Examination (--METHOD)	Incorrect rule: Also see the article "Rule FDAC084 is just damned wrong". Just suppose that glucose is measured by test strip in urin. You get the results from different sites and labs. One will have report it as a concentration (so with a unit), another will report it as presence (so "positive"/"negative" - no units) another ordinal ("1+", "2+", ...). Can you standardize all these to a single unit?
FDA097 / SD0011	Description of Arm (ARM) must equal 'Screen Failure', when Arm Code (ARMCD) is 'SCRNFAIL', and vice versa	Improvable: it should be stated whether this 1:1 matching must be case sensitive or not, especially as section 4.1.2.4 of the SDTM-IG states "It is recommended that text data be submitted in upper case text. Exceptions may include ... certain controlled terminology" - continued by in section 4.1.3.2 (Controlled Terminology Text Case): "It is recommended that controlled terminology be submitted in upper case text for all cases other than those ... prescribed in external reference (e.g. LOINC, MedDRA, ...) ... and Units". So, my interpretation of this is that these sections state the it should be "SCREEN FAILURE" (as it is under controlled terminology but not a unit nor external), but the rule text suggests otherwise. Also remark that the SDTM-IG is completely wrong about the LOINC codes: these are essentially numbers (e.g. "50555-2").
FDAC105 / SD1023	Combination of Visit Name (VISIT) and Visit Number (VISITNUM) in subject-level domains should match that in the TV domain with the exception of Unscheduled and Unplanned visits	Improvable: it should be stated how software can detect that a visit is unscheduled or unplanned, i.e. that the VISITNUM is not an integer, but a floating point number (e.g. "3.1").
FDAC108 / SD0025	Date/Time of Specimen Collection (--DTC) must be less or equal to End Date/Time of Specimen Collection (--ENDTC)	Unprecise wording: what is the word "specimen" doing here? It is only applicable to the LB domain whereas the Excel worksheet applies to ALL Findings domains.
FDAC117 / SD0021	One of End Time-Point variables values is expected to be populated when an event or an intervention occurred. (E.g., one of End Date/Time of Event or Intervention (--ENDTC), End Relative to Reference Period (--ENRF), and End Relative to Reference Period (--ENRTPT) variables values should not be missing, or Occurrence (--OCCUR) variable value should be 'N')	Improvable: wording "is expected to be populated" does not belong to a rule definition. Much better is "must be populated".
FDAC124 / SD1083	Collection Study Day (--DY) variable should be included into dataset, when Collection Study Date/Time (--DTC) variable is present	Improvable: it is not entirely clear (until one reads rule FDAC125) that this is about the presence of the columns --DTC and --DY.
FDAC126 / SD1085	Collection Study Day (--DY) variable value should be not be imputed. It may be only populated, when both Collection Study Date/Time (--DTC) and Subject Reference Start Date/Time (RFSTDTC) variables values are provided and both of them include complete date part.	Improvable: how can a system know whether a value has been imputed? Can it read our minds. I would propose to remove the wording about "imputed", so the rule would be clearer when it states: "Collection Study Day (--DY) variable value may only be populated, when both Collection Study Date/Time (--DTC) and Subject Reference Start Date/Time (RFSTDTC) variables values are provided and both of them include complete date part".
FDAC130 / SD1089	Study Day of Start (--STDY) variable value should be not be imputed. It may be only populated, when both Start Study Date/Time (--STDTC) and Subject Reference Start Date/Time (RFSTDTC) variables values are provided, and both of them include complete date part.	Improvable: see comment on rule FDAC126.
FDAC154 / SD0026	Original Units (--ORRESU) should not be NULL, when Result or Finding in Original Units (--ORRES) is provided	Incorrect rule: this rule is completely wrong! There are so many tests in the applicable domains (EG, LB, VS, ...) that do not have a unit! See: https://www.pinnacle21.net/forum/rules-fdac154-and-fdac0169.
FDAC156 / SD1120	Comments should be stored in Comments (CO) domain, rather than be put into Supplemental Qualifier (SUPP--) domains	Not implementable: how can a computer know that something is a comment? It can't. This rule is not implementable.
FDAC159 / SD1116	Split datasets should have matching variable lengths for future merges. Datasets should be resized to the maximum length used prior to splitting.	Improvable: what is meant by matching variable lengths? Probably it means that the same variable in datasets for the same domain must have identical length. "Splitted datasets" ignores the reality that in most cases, datasets were never splitted. Smart people generate such "splitted" datasets separately from the start, e.g. by category (--CAT). Also remark that complying to this rule may mean that you need to violate rule FDAC036, depending how the latter is implemented.
FDAC163 / SD0095	Supplemental Qualifiers special purpose dataset (SUPP--) can only be used to capture non-standard variables and their association to parent records in general-observation-class datasets (Events, Findings, Interventions) and Demographics	Improvable: I am not 100% sure, but I think that this rule means that SUPP-- records may never point (value of RDOMAIN) to records in other domains than FINDINGS, EVENTS, INTERVENTIONS and DM.
FDAC165 / SD2006	MedDRA coding info should be populated using variables in the Events General Observation Class, but not in SUPPQUAL domains	Unimplementable? How can a computer know that a value in a SUPPQUAL domain is a MedDRA coding? I think this such a test is extremely hard to implement.
FDAC169 / SD0029	Standard Units (--STRESU) should not be NULL, when Character Result/Finding in Std Units (--STRESC) is provided	Incorrect rule: Just like rule FDAC154, this rule is completely wrong. Suppose that the value of --STRESC is either "POSITIVE" or "NEGATIVE", e.g. for a lab test whether the presence of a specific bacteria is tested, what is than the "standard unit"? There isn't one of course!
FDAC174 / SD0045	Character Result/Finding in Std Units (--STRESC) should be provided, when Result Category (--RESCAT) is specified	Improvable: what is the wording "in Std Units" doing here? Many --STRESC values do not have units at all. When looking into the SDTM-IG, it usually says "in Std format" for the label for Findings domains.
FDAC178 / SD0047	Status (--STAT) should be set to 'NOT DONE' or Derived Flag (--DRVFL) should have a value of 'Y', when Result or Finding in Original Units (--ORRES) is NULL	Unclear rule: there is a mismach between the rule description and the error message. The latter is: "Missing value for --ORRES, when --STAT or --DRVFL is not populated".
FDAC179 / SD0048	Status (--STAT) should be NULL, when Result or Finding in Original Units (--ORRES) is provided	Unclear rule: there is a mismach between the rule description and the error message. The latter is: "Value for --ORRES is populated, when --STAT is 'NOT DONE'".
FDAC180 / SD1123	Value of Original Result (--ORRES) variable is expected to be missing, when observation was not performed (Status (--STAT) is 'NOT DONE')	Improvable: the wording "is expected" should not appear in a rule. Much better is to use "must", in this case: "must be absent".
FDAC192 / SD2019	Age Range (AGETXT) variable values should be populated with 'number-number' format	Improvable: what is meant by "number-number format"? I guess it means "with a numeric value followed by a dash followed by a numeric value" Probably this is not sufficient: can the number be a negative number? Can it be a floating point number?
FDAC197 / SD2236	A value for an Actual Arm Code (ACTARMCD) variable is expected to be equal to a value of an Arm Code (ARMCD) variable.	Unclear rule: "is expected" is not a good wording in a rule... I also do not understand what the goal of this rule is. If the values of ARMCD and ACTARMCD must be equal, why do we need ACTARMCD for? Just to fulfill this rule, sponsors will simply copy the value of ARMCD into ACTARMCD. Or is this rule a study quality measure, to see how many subjects came in the wrong arm? In that case, this should not be a rule. Or is it maybe meant that the value of ACTARMCD must be one from ARMCD in TA?
FDAC213 / SD1201	The structure of Events class domains should be one records per Event per subject. No Events with the same Collected Term (--TERM), Decoded Term (--DECOD), Category (--CAT), Subcategory (--SCAT), Severity (--SEV), and Toxicity Grade (--TOXGR) values for the same Subject (USUBJID) and the same Start Date (--STDTC) are expected.	Improvable: replace "and the same Start Date" by "and the same Start Date/Time". The former gives the impression that the date alone is sufficient. This will often not be the case.
FDAC214 / SD1029	Variables value must not include non-ASCII or non-printable characters (outside of 32-126 ASCII code range), limited to variables which values may be converted into new variable name or label (--TEST, --TESTCD, --PARM, --PARMCD, QLABEL, QNAM)	Improvable: What are "non-printable characters"? Depends on the printer isn't it? I know where this comes from. In the code tables like this one, the ASCII characters 32-126 are named "print characters", which is something else than "printable" of course.
FDAC217	SAS v5 export format has a limitation on variables length up to 200 characters. If collected value was more than 200 characters, then SUPPQUAL dataset should be used to store additional 200+ characters. Variable QNAM should have --TERM1, ---TERM2, etc. values for those records. Value splitting should be performed between words or numbers. See SDTM IG #4.1.5.3.2 for details. This risk-assessment check is triggered based on assumption that if original value was truncated, then the actual data value length is exactly 200 characters.	Incorrect rule: this should not be rule. The message further states "High risk of truncated value for --TERM variable". How can a risk be a rule violation? The suggested algorithm also will lead to a large number of false positives, as it IS allowed to use the full 200 characters in the case of a "split". There is nothing in the SDTM-IG Section #4.1.5.3.2 that forbids this. Or is the software of the FDA not smart enough to recognize such a situation (it would take me < minutes to programm this).
FDAC234 / SD2201 and following	'Added on to Existing Treatments' (ADDON) record must be populated in Trial Summay (TS) domain. It is expected for SDTM IG v3.1.2 data and required for data in all more recent SDTM versions.	Improvable: all the following rules about TS-parameters can be considerably improved. They speak about 'XXX' record (like 'ADDON record') where it is not stated what that means. Much clearer would be something like: " 'Added on to Existing Treatments' (records for which TSPARMCD has the value 'ADDON') ..."
FDAC237 / SD1215	TSVAL value must be either ISO 8601 format for time period (e.g. P80Y) or null, when TSPARMCD='AGEMAX'	Non-precise wording: the formal wording is "for durations" instead of "for time period". An hyperlink to the standard specification what an "ISO 8601 duration" is wouldn't harm either...
FDAC243 / SD1219	TSVAL value must be either ISO 8601 format for time period (e.g. P80Y) or null, when TSPARMCD='LENGTH'	Confusing rule: very bad idea here to give the example of "P80Y". Do you know of any clinical trial that took you 80 years?
FDAC246 / SD1221	TSVAL for PLANSUB record must be numeric	Non-precise rule: This would mean that the following values for "planned number of subjects" are valid: "-999", "567.23". The better wording would be "must be a positive integer".
FDAC252 / SD2208	'Study Stop Rules' (STOPRULE) record may be populated in Trial Summay (TS) domain. It is permissible for SDTM IG v3.1.2 data and in all more recent SDTM versions.	Unimplementable rule: what does this mean? Is this a rule? The error message that comes with it is: "Missing STOPRULE Trial Summary Parameter", but it doesn't match with the rule description at all.
FDAC281	'Registry Identifier' (REGID) record may be populated in Trial Summay (TS) domain. It is expected for SDTM IG v3.1.2 data and all more recent SDTM versions.	Non-precise rule: I presume "may" must be replaced by "must, otherwise the rule is meaningless.
FDAC283 / SD2223	'Pharmacological Class of Investigational Therapy' (PCLAS) record must be populated in Trial Summay (TS) domain, when study type is 'INTERVENTIONAL' and if Intervention Type (INTTYPE) is one for which pharmacological class is applicable. It is expected for SDTM IG v3.1.2 data and required for data in all more recent SDTM versions.	Unclear rule: what does "if Intervention Type (INTTYPE) is one for which pharmacological class is applicable" mean? How can a software program know this? I looked into the CDISC Controlled Terminology, but it doesn't tell me for which of the INTTYPE values a "pharmacological class" applies either.
FDAC292 / SD2245	TSVAL variable value must be in ISO 8601 format, when TSPARMCD='DCUTDTC'	Incomplete rule: 'DCUDTC' means "Data Cutoff Date", so a date. However, the rule would also allow an ISO 8601 duration (like P80Y) isn't it? More exact is "must be in ISO 8601 date format
FDAC296 / SD2246	TSVAL variable value must be numeric, when TSPARMCD='NARMS'	Incorrect wording: according to this rule, the values "-3" and "3.1415927" would be acceptable. Are they? More exact is "must be a positive integer".
FDAC301 / SD2247	TSVAL variable value must be in ISO 8601 format, when TSPARMCD='SSTDTC'	Incomplete rule: same problem as with rule FDAC292: 'SSTDTC' means "Study Start Date", so a date is meant. However, the rule would also allow an ISO 8601 duration (like P80Y) isn't it? More exact is "must be in ISO 8601 date format.
FDAC303 / SD2248	TSVAL variable value must be in ISO 8601 format, when TSPARMCD='SETDTC'	Incomplete rule: same problem as with rule FDAC292: 'SETDTC' means "Study End Date", so a date is meant. However, the rule would also allow an ISO 8601 duration (like P80Y) isn't it? More exact is "must be in ISO 8601 date format.
FDAC305	TSVAL variable value must be numeric, when TSPARMCD='ACTSUB'	Incorrect wording: according to this rule, the values "-300" and "300.56" would be acceptable. Are they? "ACTSUB" means "Actual Number of Subjects". More exact is "must be a positive integer".
FDAC340	<Variable Label> (<Variable Name>) variable values should be populated with terms found in '<Codelist Name>' (<NCI Code>) CDISC controlled terminology codelist. New values cannot be added into the CDISC CT non-extensible codelist	Incomplete rule / Ununderstandable rule: What does this mean? What is the precondition? Should all variable values appear in codelists? What does the last sentence do here?
FDAC341	<Variable Label> (<Variable Name>) variable values should be populated with terms found in '<Codelist Name>' (<NCI Code>) CDISC controlled terminology codelist. New terms can be added as long as they are not duplicates, synonyms or subsets of existing standard terms.	Incomplete rule / Ununderstandable rule: Same problem as for rule FDAC340. Also: how can a software system whether a newly added term in a codelist is a synonym or not from an already present term?
FDAC342	In <Domain Description> (<Domain Name>) domain values for <Variable Label "Short Name"> (--TESTCD) and <Variable Label "Test Name"> (--TEST) variables must be populated using terms with the same Codelist Code value in CDISC control terminology. There is one-to-one relationship between --TESTCD and --TEST values defined in CDISC control terminology by Codelist Code value	Unclear rule: what is the "Codelist code"? Is the NCI code meant? If so, please state so. The rule could be interpreted as that both codelists require the same OID in the define.xml.
FDAC343	<Variable Label> (<Variable Name>) variable values must be populated with terms found in '<Codelist Name>' (<NCI Code>) CDISC controlled terminology codelist, when <Variable Label (from WhereClause)> (<Variable Name (from WhereClause)>) <Comparator> '<Value (from WhereClause)>'. New values cannot be added into the CDISC CT non-extensible codelist.	Unclear rule: As one of the developers of the define.xml 2.0, even I do not have any idea what is meant here.
FDAC344	<Variable Label> (<Variable Name>) variable values must be populated with terms found in '<Codelist Name>' (<NCI Code>) CDISC controlled terminology codelist, when <Variable Label (from WhereClause)> (<Variable Name (from WhereClause)>) <Comparator> '<Value (from WhereClause)>'. New terms can be added as long as they are not duplicates, synonyms or subsets of existing standard terms.	Unclear rule: Same problem as with rule FDAC343 - even I (as a define.xml expert) do not have any idea what is meant here.
FDAC345	In <Domain Description> (<Domain Name>) domain values for <Variable Label "Short Name"> (--TESTCD) and <Variable Label "Test Name"> (--TEST) variables must be populated using terms with the same Codelist Code value in CDISC control terminology. There is one-to-one relationship between --TESTCD and --TEST values defined in CDISC control terminology by Codelist Code value within the same <Variable Name (from WhereClause)> <Comparator> '<Value (from WhereClause)>'.	Unclear rule: No idea what this means either.

TODO: add the rules that are PMDA specific