Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: don't duplicate nested compound fields in metadataBlocks returned by search API #11172

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

vera
Copy link
Contributor

@vera vera commented Jan 21, 2025

What this PR does / why we need it:

The following bug affects metadata blocks which contain deeply nested compound fields (compound fields within compound fields).

Example: My metadata block exampleBlock contains the field "A", which is a compound field and contains the field "A.B", which is also a compound field and contains the fields "A.B.C1" and "A.B.C2".

"A.B.C1" and "A.B.C2" are not compound fields (e.g. could be text or numbers).

Then, if I use the search API to search for some query and request the deeply nested metadata field with metadata_fields=exampleBlock:*, the returned JSON duplicates the nested compound field A.B (once as child of A (correct), once as its own field without a parent (not correct)):

{
  "status": "OK",
  "data": {
    "q": "*:*",
    "total_count": 194,
    "start": 0,
    "spelling_alternatives": {},
    "items": [
      {
        "name": "Test dataset",
        ...
        "metadataBlocks": {
          "resource": {
            "displayName": "exampleBlock",
            "fields": [
              {
                "typeName": "A",
                "multiple": true,
                "typeClass": "compound",
                "value": [
                  {
                    "A.B": {
                      "typeName": "A.B",
                      "multiple": false,
                      "typeClass": "compound",
                      "value": {
                        "A.B.C1": {
                          "typeName": "A.B.C1",
                          "multiple": false,
                          "typeClass": "primitive",
                          "value": "example"
                        },
                        "A.B.C2": {
                          "typeName": "A.B.C2",
                          "multiple": false,
                          "typeClass": "primitive",
                          "value": "example"
                        }
                      }
                    }
                  }
                ]
              },
              {
                "typeName": "A.B",
                "multiple": false,
                "typeClass": "compound",
                "value": {
                  "A.B.C1": {
                    "typeName": "A.B.C1",
                    "multiple": false,
                    "typeClass": "primitive",
                    "value": "example"
                  },
                  "A.B.C2": {
                    "typeName": "A.B.C2",
                    "multiple": false,
                    "typeClass": "primitive",
                    "value": "example"
                  }
                }
              }
            ]
          }
        }
      }
    ],
    "count_in_response": 1
  }
}

The code change ensures only fields on the top level of the metadata block are added into the fields array of the returned JSON (which includes their children, if there are any). This eliminates the duplication of the child compound field.

I believe this makes sense and doesn't break anything because non-top level fields aren't meant to be requested according to the documentation:

The value must be in the form “{metadata_block_name}:{field_name}” to include a specific field from a metadata block (see example) or “{metadata_field_set_name}:*” to include all the fields for a metadata block (see example). “{field_name}” cannot be a subfield of a compound field.

https://guides.dataverse.org/en/latest/api/search.html

Which issue(s) this PR closes:

/

Special notes for your reviewer:

/

Suggestions on how to test this:

I would have added a test for this bug, but it wasn't straighforward, because none of the default metadata blocks contain deeply nested compound fields.

The existing tests for metadata_fields=... can be run to confirm that they weren't broken: mvn test -Dtest="SearchIT#testSearchDynamicMetadataFields"

To test the bug described above, you need to:

  1. Create a metadata block containing nested compound fields, e.g. use one in here: nested_compound_test.tar.gz

    curl http://localhost:8080/api/admin/datasetfield/load -H "Content-type: text/tab-separated-values" -X POST --upload-file nested_compound_test2.tsv

  2. Enable metadata block

  3. Update Solr schema, e.g. using this prepared schema.xml
    schema.txt (had to rename to .txt to upload here)

    docker cp schema.xml solr-1:/var/solr/data/collection1/conf/schema.xml (or copy wherever your Solr is running)

    curl "http://localhost:8983/solr/admin/cores?action=RELOAD&core=collection1"

  4. Create a dataset using the metadata block, e.g. using a JSON file in the archive linked above

    curl -H "X-Dataverse-key: $API_TOKEN" -X POST "http://localhost:8080/api/dataverses/root/datasets" --upload-file nested_compound_test2.json -H 'Content-type:application/json'

  5. Go to http://localhost:8080/dataset.xhtml?persistentId=... (using PID returned by the API in 4.)

  6. Publish dataset

  7. Perform a search that will find the dataset and request the nested metadata block

    curl http://localhost:8080/api/search?q=finches&type=dataset&metadata_fields=nested_compound_test2:*

  8. Check whether the nested compound fields are returned correctly

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

/

Is there a release notes update needed for this change?:

/

Additional documentation:

/

@coveralls
Copy link

Coverage Status

coverage: 22.751%. remained the same
when pulling a3d5410 on vera:fix-search-api-nested-metadata-fields
into 69ebed2 on IQSS:develop.

@cmbz
Copy link

cmbz commented Jan 27, 2025

2025/01/27: Moving to Needs Sizing, possibly a size 3

@ffritze
Copy link
Contributor

ffritze commented Jan 29, 2025

Maybe this issue is already solved by this #10764 ? The fix should be available in the next release, I suppose.

@vera
Copy link
Contributor Author

vera commented Jan 29, 2025

@ffritze Thanks for the pointer! I just tested the search API on develop, which includes your changes. (I've also just added more detail to the "how to test" steps above)

The problem doesn't seem to be fixed. Here's what the search API returns on develop:

...
        "metadataBlocks": {
          "nested_compound_test2": {
            "displayName": "Nested Compound Test 2",
            "fields": [
              {
                "typeName": "person2",
                "multiple": true,
                "typeClass": "compound",
                "value": [
                  {
                    "identifier2": {
                      "typeName": "identifier2",
                      "multiple": false,
                      "typeClass": "compound",
                      "value": {
                        "identifier_type2": {
                          "typeName": "identifier_type2",
                          "multiple": false,
                          "typeClass": "primitive",
                          "value": "ORCID"
                        },
                        "identifier_text2": {
                          "typeName": "identifier_text2",
                          "multiple": false,
                          "typeClass": "primitive",
                          "value": "abcdef123456"
                        }
                      }
                    }
                  },
                  {
                    "identifier2": {
                      "typeName": "identifier2",
                      "multiple": false,
                      "typeClass": "compound",
                      "value": {
                        "identifier_type2": {
                          "typeName": "identifier_type2",
                          "multiple": false,
                          "typeClass": "primitive",
                          "value": "ORCID"
                        },
                        "identifier_text2": {
                          "typeName": "identifier_text2",
                          "multiple": false,
                          "typeClass": "primitive",
                          "value": "ghijkl123456"
                        }
                      }
                    }
                  }
                ]
              },
              {
                "typeName": "identifier2",
                "multiple": false,
                "typeClass": "compound",
                "value": {
                  "identifier_type2": {
                    "typeName": "identifier_type2",
                    "multiple": false,
                    "typeClass": "primitive",
                    "value": "ORCID"
                  },
                  "identifier_text2": {
                    "typeName": "identifier_text2",
                    "multiple": false,
                    "typeClass": "primitive",
                    "value": "abcdef123456"
                  }
                }
              },
              {
                "typeName": "identifier2",
                "multiple": false,
                "typeClass": "compound",
                "value": {
                  "identifier_type2": {
                    "typeName": "identifier_type2",
                    "multiple": false,
                    "typeClass": "primitive",
                    "value": "ORCID"
                  },
                  "identifier_text2": {
                    "typeName": "identifier_text2",
                    "multiple": false,
                    "typeClass": "primitive",
                    "value": "ghijkl123456"
                  }
                }
              }
            ]
          }
        },
...

(Nested field identifier2 is still duplicated)

Here's what the search API returns with this PR:

{
...
        "metadataBlocks": {
          "nested_compound_test2": {
            "displayName": "Nested Compound Test 2",
            "fields": [
              {
                "typeName": "person2",
                "multiple": true,
                "typeClass": "compound",
                "value": [
                  {
                    "identifier2": {
                      "typeName": "identifier2",
                      "multiple": false,
                      "typeClass": "compound",
                      "value": {
                        "identifier_type2": {
                          "typeName": "identifier_type2",
                          "multiple": false,
                          "typeClass": "primitive",
                          "value": "ORCID"
                        },
                        "identifier_text2": {
                          "typeName": "identifier_text2",
                          "multiple": false,
                          "typeClass": "primitive",
                          "value": "abcdef123456"
                        }
                      }
                    }
                  },
                  {
                    "identifier2": {
                      "typeName": "identifier2",
                      "multiple": false,
                      "typeClass": "compound",
                      "value": {
                        "identifier_type2": {
                          "typeName": "identifier_type2",
                          "multiple": false,
                          "typeClass": "primitive",
                          "value": "ORCID"
                        },
                        "identifier_text2": {
                          "typeName": "identifier_text2",
                          "multiple": false,
                          "typeClass": "primitive",
                          "value": "ghijkl123456"
                        }
                      }
                    }
                  }
                ]
              }
            ]
          }
        },
...

@qqmyers qqmyers added the Size: 10 A percentage of a sprint. 7 hours. label Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Size: 10 A percentage of a sprint. 7 hours.
Projects
Status: SPRINT READY
Development

Successfully merging this pull request may close these issues.

5 participants