Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove formatting restrictions on <instanceID> (submission UUID) strings #861

Closed
jnm opened this issue Jan 25, 2023 · 2 comments
Closed
Labels

Comments

@jnm
Copy link
Member

jnm commented Jan 25, 2023

The OpenRosa spec allows people to implement a "custom ID scheme" for <instanceID>, although it "must be a universally unique string identifying this specific submission".

We don't currently support this:

def get_uuid_from_xml(xml):
def _uuid_only(uuid, regex):
matches = regex.match(uuid)
if matches and len(matches.groups()) > 0:
return matches.groups()[0]
return None
uuid = get_meta_from_xml(xml, "instanceID")
regex = re.compile(r"uuid:(.*)")
if uuid:
return _uuid_only(uuid, regex)
# check in survey_node attributes
xml = clean_and_parse_xml(xml)
children = xml.childNodes
# children ideally contains a single element
# that is the parent of all survey elements
if children.length == 0:
raise ValueError(t("XML string must have a survey element."))
survey_node = children[0]
uuid = survey_node.getAttribute('instanceID')
if uuid != '':
return _uuid_only(uuid, regex)
return None

Some people have uploaded submissions containing the likes of

<meta><instanceID>RYAPHKJBDOZJWQ2W5BDXQ0MLU</instanceID>…

This results in the logger_instance.uuid column diverging from the <instanceID> in the XML.

We should:

  1. Enforce UUID uniqueness across the entire logger_instance table;
  2. After that, remove our formatting restrictions on the <instanceID> string
@jnm
Copy link
Member Author

jnm commented Aug 5, 2024

We could enforce xml_hash uniqueness before removing the formatting restrictions, but we could not (I was mistaken) enforce UUID uniqueness while the formatting restrictions are in place. That's because we need to identify cases where UUIDs collide but XML content is not identical, and then rewrite their UUIDs to append something like "DUPLICATE n" (which would violate the existing formatting restrictions).

Some internal discussion at https://chat.kobotoolbox.org/#narrow/stream/4-Kobo-Dev/topic/Duplicated.20submission.20UUIDs/near/192516

@noliveleger
Copy link
Contributor

Closed because it is gonna be handled by kobotoolbox/kpi#5047

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants