Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Usage in Rucio #569

Closed
dynamic-entropy opened this issue Jul 18, 2023 · 8 comments
Closed

Inconsistent Usage in Rucio #569

dynamic-entropy opened this issue Jul 18, 2023 · 8 comments
Assignees

Comments

@dynamic-entropy
Copy link
Contributor

Considering this as the correct definition of various available space in rucio:

From here, #212

Expect the twiki soon, but in short:
Expired: no rule locks the data, it may be deleted lazily as space is needed
Obsolete: no rule locks the data and it should be greedily deleted (this is a subset of expired)
Rucio: total space (locked + expired + unavailable)
        per account: each account’s rules added up, may be > rucio (overlap of rule locks)
Static: this is the eventual watermark after which expired data starts to get cleaned up (to be activated very soon, set to DDM quota + phedex local usage at first, there will be a call to sites soon if they want to change this)
Unavailable: data volume pending transfer to the site

The usage in almost all rucio rses is wrong and this causes frequent problems and unexpected emergencies with managing and defining usage.


Steps to reproduce:
rucio-admin rse info T1_UK_RAL_Disk_Test

  rucio
    files: 1
    free: None
    rse: T1_UK_RAL_Disk_Test
    rse_id: d6896b5cf7574b21b24b111662833cd9
    source: rucio
    total: 89456809
    updated_at: 2022-09-14 17:33:37
    used: 89456809
  unavailable
    files: 15
    free: None
    rse: T1_UK_RAL_Disk_Test
    rse_id: d6896b5cf7574b21b24b111662833cd9
    source: unavailable
    total: 41354994537
    updated_at: 2022-07-26 18:00:01
    used: 41354994537

However, there are no locks on the rse in the locks table.

This could be related to #471
It would be nice if @KatyEllis can also give us the actual usage of the directory

Triggered by: https://its.cern.ch/jira/browse/CMSDM-81

FYI @dciangot @KatyEllis

@yuyiguo
Copy link
Member

yuyiguo commented Aug 4, 2023

There was only one file on rse_id: d6896b5cf7574b21b24b111662833cd9 . The file is name=/store/mc/RunIIFall17NanoAODv7/GluGluSpin0ToGammaGamma_W_5p6_M_1250_TuneCP2_13TeV_pythia8/NANOAODSIM/PU2017_12Apr2018_Nano02Apr2020_102X_mc2017_realistic_v8-v1/100000/64B8DA1A-EBA4-004D-87EA-9CA1BCF3F1B9.root
and bytes=89456809. However, there were no rules associated with this file. That was why no lock on the rse.
In addition, there was no rule for the replica in the rule_history table.
@dynamic-entropy Do you know how the file uploaded into Rucio? Can we reproduce the problem? Do you have other rses that have this problem?
Can someone check what are the other files on the rse? These unavailable files (to rucio) took 41354994537 bytes.

@dynamic-entropy
Copy link
Contributor Author

dynamic-entropy commented Aug 6, 2023

Hi Yuyi
I am not aware of how the file was uploaded.
All RSEs have the same problem. I just took this as an example for simplicity.
There are no files on the RSE.

Unavailable: data volume pending transfer to the site

The reason why we are checking for locks and rules is the above definition of unavailable. Without a lock, there should not be usage of unavailable as per this definition.

@yuyiguo
Copy link
Member

yuyiguo commented Aug 10, 2023

Hi @dynamic-entropy ,
I tried to query multiple tables to see if I could figure out what was going on, but not very successful. The queries took a long time to run so it was very slow to debug. After talking with my FNAL colleague, we opened a question on Rucio support. Please see https://mattermost.web.cern.ch/rucio/pl/jktrzoenotggpfhptahr8wx4jo .
What is our goal on this? To claim back the "used" space?

@yuyiguo
Copy link
Member

yuyiguo commented Aug 11, 2023

There is an Oracle procedure running to update the unavailable in rse_usage table. The unavailable was calculated from the replicas table. However, there was no data in the replicas table now for the RSE. It seems that the operations between different tables did not run in order.
Continue the discussion with Rucio dev, more later.

@yuyiguo
Copy link
Member

yuyiguo commented Aug 11, 2023

Here is an example that Rucio dev gave on how this happened:
Example:

An RSE is completely empty.
You add one rule on a single file DID. One replica is created in state COPYING.
The procedure runs and set the unavailable RSE usage counter to 1.
The transfer completes, the replica state becomes AVAILABLE and the rule state becomes OK.
The procedure runs again. Since no replicas are in state UNAVAILABLE or COPYING, the value of the RSE usage counter remains 1.

There is no procedure to update the UNAVAILABLE or COPYING until the next COPYING, right?
The state of a replica is updated by different daemons, depending on what happens.

I wonder ATLAS doesn't care about the difference RSE usage between Rucio and storage?
The RSEs that matter are always active (i.e. there is at least one active transfer). So the unavailable RSE usage counter is updated regularly.

Based on the above discussion with Rucio dev, I understood that these inconsistencies only happen on the not frequently used RSEs in Atlas. So they just ignore them.

What RSEs affected to CMS? @dynamic-entropy

@yuyiguo
Copy link
Member

yuyiguo commented Sep 12, 2023

@dynamic-entropy please review the issue.

@yuyiguo
Copy link
Member

yuyiguo commented Sep 13, 2023

Hi @dynamic-entropy , if you have no any more comments or inputs, I will close this ticket soon. As we discussed in the Rico meeting, we need to update the status in time. If you need to discuss this in the future, you can open a new ticket.

@yuyiguo
Copy link
Member

yuyiguo commented Sep 20, 2023

Hi @dynamic-entropy ,
We ran off time on this issue as we are closing the quarter. If you still need help, please open a new issue.

@yuyiguo yuyiguo closed this as completed Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants