[spark] Clean empty directory after removing orphan files #4824

askwang · 2025-01-03T02:50:44Z

Purpose

Empty directories can not be cleaned after executing remove_orphan_files.

Tests

API and Format

Documentation

JingsongLi · 2025-01-03T03:54:11Z

...aimon-spark-common/src/main/scala/org/apache/paimon/spark/orphan/SparkOrphanFilesClean.scala

@@ -137,6 +138,7 @@ case class SparkOrphanFilesClean(
        it =>
          var deletedFilesCount = 0L
          var deletedFilesLenInBytes = 0L
+          val involvedDirectories = new ArrayBuffer[String]()


Should be a Set to do deduplicating?

JingsongLi · 2025-01-03T03:57:07Z

...aimon-spark-common/src/main/scala/org/apache/paimon/spark/orphan/SparkOrphanFilesClean.scala

            deletedFilesCount += 1
          }
          logInfo(
            s"Total cleaned files: $deletedFilesCount, Total cleaned files len : $deletedFilesLenInBytes")
-          Iterator.single((deletedFilesCount, deletedFilesLenInBytes))
+          Iterator.single((deletedFilesCount, deletedFilesLenInBytes, involvedDirectories))


distinct directories and clean them with multi concurrent?

JingsongLi · 2025-01-03T07:58:23Z

paimon-core/src/main/java/org/apache/paimon/operation/OrphanFilesClean.java

@@ -91,6 +95,10 @@ public abstract class OrphanFilesClean implements Serializable {
    protected final int partitionKeysNum;
    protected final Path location;

+    private static final String THREAD_NAME = "ORPHAN-FILES-CLEAN-THREAD-POOL";
+    private static final ThreadPoolExecutor executorService =


Do not introduce another thread pool, for LocalOrphanFilesClean, you should just use executor in it.

JingsongLi · 2025-01-03T07:59:05Z

...aimon-spark-common/src/main/scala/org/apache/paimon/spark/orphan/SparkOrphanFilesClean.scala

+    // clean empty directories
+    val deletedPaths =
+      deleted.flatMap { case (_, _, paths) => paths }.collect().map(new Path(_)).toSet
+    cleanEmptyDirectory(deletedPaths.asJava)


You should use Spark distributed RDD or DataFrame to clean them.

wwj6591812 · 2025-01-03T09:15:38Z

paimon-core/src/main/java/org/apache/paimon/operation/OrphanFilesClean.java

+
+        randomlyOnlyExecute(executorService, this::tryDeleteEmptyDirectory, deletedPaths);
+
+        for (int level = 0; level < partitionKeysNum; level++) {


What the level used for?

When the empty bucket directories are cleaned, the parent partition directories should also cleaned if they are empty.

JingsongLi · 2025-01-08T08:25:53Z

cc @Zouxxyy

JingsongLi · 2025-01-08T08:28:00Z

...aimon-spark-common/src/main/scala/org/apache/paimon/spark/orphan/SparkOrphanFilesClean.scala

      }
+      .cache()


Can we avoid caching here?

This dataset need to use twice, currently cannot be removed, do you have a better way?

paimon-core/src/main/java/org/apache/paimon/operation/OrphanFilesClean.java

...aimon-spark-common/src/main/scala/org/apache/paimon/spark/orphan/SparkOrphanFilesClean.scala

paimon-core/src/main/java/org/apache/paimon/operation/OrphanFilesClean.java

askwang · 2025-01-09T12:07:28Z

@Zouxxyy Very thanks for you review, has adjusted. You can help review again for your free.

Zouxxyy

The overall implementation idea is already quite clear. In each task, clean empty dirs of the involved buckets set.

And remember to clean up any unused code, such as "ORPHAN-FILES-CLEAN-THREAD-POOL," etc.

Zouxxyy · 2025-01-10T06:12:40Z

paimon-core/src/main/java/org/apache/paimon/operation/LocalOrphanFilesClean.java

        return new CleanOrphanFilesResult(
                deleteFiles.size(), deletedFilesLenInBytes.get(), deleteFiles);
    }

+    private void cleanEmptyDirectory(List<Path> deleteFiles) {


Extract a common method that can be used by both core, spark and so on. Note: bucket and partitions logic can be processed together, as long as we set level = partitionKeysNum + 1.

Let the entire method be executed through randomlyOnlyExecute.

JingsongLi reviewed Jan 3, 2025

View reviewed changes

wwj6591812 reviewed Jan 3, 2025

View reviewed changes

askwang closed this Jan 6, 2025

askwang reopened this Jan 6, 2025

JingsongLi reviewed Jan 8, 2025

View reviewed changes

Zouxxyy reviewed Jan 8, 2025

View reviewed changes

Zouxxyy reviewed Jan 9, 2025

View reviewed changes

paimon-core/src/main/java/org/apache/paimon/operation/OrphanFilesClean.java Outdated Show resolved Hide resolved

askwang added 6 commits January 9, 2025 20:13

clean empty directory

a352c15

fix

caac993

fix comments

1c92c40

1

27afea9

1

6bac288

fix comments

28e50eb

askwang force-pushed the clean_empty_directory branch from c19c457 to 28e50eb Compare January 9, 2025 12:24

fix

3bcbd7d

Zouxxyy reviewed Jan 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spark] Clean empty directory after removing orphan files #4824

[spark] Clean empty directory after removing orphan files #4824

askwang commented Jan 3, 2025 •

edited

Loading

JingsongLi Jan 3, 2025

JingsongLi Jan 3, 2025

JingsongLi Jan 3, 2025

JingsongLi Jan 3, 2025

wwj6591812 Jan 3, 2025

askwang Jan 6, 2025

JingsongLi commented Jan 8, 2025

JingsongLi Jan 8, 2025

askwang Jan 8, 2025

askwang commented Jan 9, 2025

Zouxxyy left a comment

Zouxxyy Jan 10, 2025


		randomlyOnlyExecute(executorService, this::tryDeleteEmptyDirectory, deletedPaths);

		for (int level = 0; level < partitionKeysNum; level++) {

[spark] Clean empty directory after removing orphan files #4824

Are you sure you want to change the base?

[spark] Clean empty directory after removing orphan files #4824

Conversation

askwang commented Jan 3, 2025 • edited Loading

Purpose

Tests

API and Format

Documentation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JingsongLi commented Jan 8, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

askwang commented Jan 9, 2025

Zouxxyy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

askwang commented Jan 3, 2025 •

edited

Loading