Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get started? #76

Open
pankajkumar229 opened this issue Nov 6, 2021 · 11 comments
Open

How to get started? #76

pankajkumar229 opened this issue Nov 6, 2021 · 11 comments

Comments

@pankajkumar229
Copy link

Is there an easier way to get started?

I tried to setup a machine and install all requirements. Would try tomorrow to go further but maybe I am doing something wrong:

The error I am at currently is:
"""
2021-11-05 22:23:59.523515: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
Traceback (most recent call last):
File "run_clm_apps.py", line 800, in
main()
File "run_clm_apps.py", line 342, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/pankaj/.local/lib/python3.8/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 14, in init
File "run_clm_apps.py", line 174, in post_init
raise ValueError("Need either a dataset name or a training/validation file.")
ValueError: Need either a dataset name or a training/validation file.
"""
Also, getting the requirements to work was quite difficult on my machine. Wondering if I am doing something wrong.

@ncoop57
Copy link
Collaborator

ncoop57 commented Nov 13, 2021

What os are you trying to run this on? Also, it looks like you do not have CUDA installed properly which will make it difficult to train quickly:

2021-11-05 22:23:59.523515: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory

@pankajkumar229
Copy link
Author

HI ncoop57, Can you help me a little more? I am trying to load the data. I downloaded the dataset from the-eye.eu but I am not able to correctly pass it to training. PLease help.

pankaj@lc-tower1:~/source/gpt-code-clippy/training$ python3 run_clm_apps.py --output_dir /data/opengpt/output/ --dataset_name /data/opengpt/the-eye.eu/public/AI/training_data/code_clippy_data/code_clippy_dedup_data
11/15/2021 11:34:58 - INFO - absl - Starting the local TPU driver.
11/15/2021 11:34:58 - INFO - absl - Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local://
11/15/2021 11:34:58 - INFO - absl - Unable to initialize backend 'gpu': Not found: Could not find registered platform with name: "cuda". Available platform names are: Host Interpreter
11/15/2021 11:34:58 - INFO - absl - Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available.
11/15/2021 11:34:58 - WARNING - absl - No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
11/15/2021 11:34:58 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=-1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_steps=500,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
greater_is_better=None,
group_by_length=False,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=/data/opengpt/output/runs/Nov15_11-34-58_lc-tower1,
logging_first_step=False,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
output_dir=/data/opengpt/output/,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=output,
push_to_hub_organization=None,
push_to_hub_token=None,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=/data/opengpt/output/,
save_on_each_node=False,
save_steps=500,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
Traceback (most recent call last):
  File "run_clm_apps.py", line 800, in <module>
    main()
  File "run_clm_apps.py", line 384, in main
    whole_dataset = load_dataset(
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/load.py", line 811, in load_dataset
    module_path, hash, resolved_file_path = prepare_module(
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/load.py", line 365, in prepare_module
    raise FileNotFoundError(
FileNotFoundError: Couldn't find file locally at /data/opengpt/the-eye.eu/public/AI/training_data/code_clippy_data/code_clippy_dedup_data/code_clippy_dedup_data.py. Please provide a valid dataset name

@ncoop57
Copy link
Collaborator

ncoop57 commented Nov 17, 2021

Do you have this file stored here?

/data/opengpt/the-eye.eu/public/AI/training_data/code_clippy_data/code_clippy_dedup_data/code_clippy_dedup_data.py

If not, I believe this file is the same one: https://github.com/CodedotAl/gpt-code-clippy/blob/camera-ready/data_processing/code_clippy_filter.py
So if you copy that one and put it at the above path I believe you should be good to go

@pankajkumar229
Copy link
Author

pankajkumar229 commented Nov 21, 2021

I tried the other way. Is it possible that the huggingface method is not working anymore since the data page's format has changed? I will try the download method and try now.

Error details:
Command used

python3 run_clm_apps.py --output_dir /data/opengpt/output/ --dataset_name CodedotAI/code_clippy

Output

eDatetime);\n\n\t\t\tvar getUrlParameter = function getUrlParameter(sParam) {\n\t\t\t\tvar sPageURL = window.location.search.substring(1),\n\t\t\t\tsURLVariables = sPageURL.split(\'&\'),\n\t\t\t\tsParameterName,\n\t\t\t\ti;\n\n\t\t\t\tfor (i = 0; i < sURLVariables.length; i++) {\n\t\t\t\t\tsParameterName = sURLVariables[i].split(\'=\');\n\n\t\t\t\t\tif (sParameterName[0] === sParam) {\n\t\t\t\t\t\treturn sParameterName[1] === undefined ? true : decodeURIComponent(sParameterName[1]);\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t};\n\t\t\tfunction toggle(className){\n\t\t\t\tvar order = getUrlParameter(\'order\');\n\t\t\t\tvar elements = document.getElementsByClassName(className);\n\t\t\t\tfor(var i = 0, length = elements.length; i < length; i++) {\n\t\t\t\t\tvar currHref = elements[i].href;\n\t\t\t\t\tif(order==\'desc\'){ \n\t\t\t\t\t\tvar chg = currHref.replace(\'desc\', \'asc\');\n\t\t\t\t\t\telements[i].href = chg;\n\t\t\t\t\t}\n\t\t\t\t\tif(order==\'asc\'){ \n\t\t\t\t\t\tvar chg = currHref.replace(\'asc\', \'desc\');\n\t\t\t\t\t\telements[i].href = chg;\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t};\n\t\t\tfunction readableFileSize(size) {\n\t\t\t\tvar units = [\'B\', \'KB\', \'MB\', \'GB\', \'TB\', \'PB\', \'EB\', \'ZB\', \'YB\'];\n\t\t\t\tvar i = 0;\n\t\t\t\twhile(size >= 1024) {\n\t\t\t\t\tsize /= 1024;\n\t\t\t\t\t++i;\n\t\t\t\t}\n\t\t\t\treturn parseFloat(size).toFixed(2) + \' \' + units[i];\n\t\t\t}\n\n\t\t\tfunction changeSize() {\n\t\t\t\tvar sizes = document.getElementsByTagName("size");\n\n\t\t\t\tfor (var i = 0; i < sizes.length; i++) {\n\t\t\t\t\thumanSize = readableFileSize(sizes[i].innerHTML);\n\t\t\t\t\tsizes[i].innerHTML = humanSize\n\t\t\t\t}\n\t\t\t}\n\t\t</script>\n\t</body>\n</html>\n'
Traceback (most recent call last):
  File "run_clm_apps.py", line 800, in <module>
    main()
  File "run_clm_apps.py", line 384, in main
    whole_dataset = load_dataset(
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/load.py", line 827, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/load.py", line 687, in load_dataset_builder
    builder_cls = import_main_class(module_path, dataset=True)
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/load.py", line 91, in import_main_class
    module = importlib.import_module(module_path)
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 848, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/pankaj/.cache/huggingface/modules/datasets_modules/datasets/code_clippy/d332f69d036e8c80f47bc9a96d676c3fa30cb50af7bb81e2d4d12e80b83efc4d/code_clippy.py", line 68, in <module>
    url_elements = results.find_all("a")
AttributeError: 'NoneType' object has no attribute 'find_all'


@pankajkumar229
Copy link
Author

If there is a command you use, could you tell me? @ncoop57

@pankajkumar229
Copy link
Author

After some not giving up talk while watching TBBT in the last season, I am finally able to at least get the data download to start.

Here are the steps:

  1. Make sure to install the requirements.txt properly and also install any package warnings during runtime.
  2. While running, it will download a file code_clippy.py and run it in the home directory . It has a few issues that need to be fixed.
    - 62 : Needs to have another slash at the end
    - 65 : Needs tbody instead of pre
    - 66,69,70,71 : Need to skip the first element of the table

Here is the command that I ran: """python3 run_clm_apps.py --output_dir /data/opengpt/output/ --cache_dir /data/opengpt/cache --dataset_name CodedotAI/code_clippy"""

@reshinthadithyan
Copy link
Collaborator

Hello Pankaj. Are you trying to fine-tune a model with the dataset? If so, my suggestion would be the following,

  • Use this script to download and use the dataset. You can load this script to a datasets.Dataset object by datasets.load_dataset("PATH_TO_DATASETS_SCRIPT",split="train").
  • The scripts you're trying to use were built as early as July/August 2021. There has been a good amount of change wrt Flax APIs in transformers.
  • My suggestion would be to use this datasets.Dataset object to a standard, very occasionally updated HF provided run_clm.py script. Since this would provide you with a much quicker hack.
    Let me know if I am understanding your problem right and the solution is catered to your need. Feel free to make a PR if you find a fix in the existing bugs. Thanks.
    - Reshinth

@pankajkumar229
Copy link
Author

Hi Reshinth

I am trying to run it on my new GPU and see how good it can get on it if possible. I am new to using transformers. So you are suggesting to download the run_clm.py file run it and pass the code_clippy python file as a parameter. LEt me try that.

@pankajkumar229
Copy link
Author

hi @reshinthadithyan and @ncoop57 , the data download often breaks with read timeouts. IS there a way to handle it?

@pankajkumar229
Copy link
Author

I think I could download the data but the command gives an error at the end before loading data (would 64GB memory suffice?):

/usr/bin/python3.8 /home/pankaj/tools/pycharm-community-2020.2/plugins/python-ce/helpers/pydev/pydevd.py --multiproc --qt-support=auto --client 127.0.0.1 --port 45211 --file /home/pankaj/PycharmProjects/CodeClippy/CodeClippyMain.py
pydev debugger: process 3990648 is connecting
Connected to pydev debugger (build 202.6397.98)
Using the latest cached version of the module from /home/pankaj/.cache/huggingface/modules/datasets_modules/datasets/code_clippy/d332f69d036e8c80f47bc9a96d676c3fa30cb50af7bb81e2d4d12e80b83efc4d (last modified on Sun Nov 21 18:46:34 2021) since it couldn't be found locally at ~/source/datasets/datasets/code_clippy/code_clippy.py/code_clippy.py or remotely (FileNotFoundError).
Using the latest cached version of the module from /home/pankaj/.cache/huggingface/modules/datasets_modules/datasets/code_clippy/d332f69d036e8c80f47bc9a96d676c3fa30cb50af7bb81e2d4d12e80b83efc4d (last modified on Sun Nov 21 18:46:34 2021) since it couldn't be found locally at ~/source/datasets/datasets/code_clippy/code_clippy.py/code_clippy.py or remotely (FileNotFoundError).
No config specified, defaulting to: code_clippy/code_clippy_dedup_data
Downloading and preparing dataset code_clippy/code_clippy_dedup_data (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /data/opengpt/code_clippy/code_clippy_dedup_data/0.1.0/d332f69d036e8c80f47bc9a96d676c3fa30cb50af7bb81e2d4d12e80b83efc4d...
Traceback (most recent call last):
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/builder.py", line 1075, in _prepare_split
    writer.write(example, key)
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/arrow_writer.py", line 347, in write
    self.write_examples_on_file()
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/arrow_writer.py", line 292, in write_examples_on_file
    pa_array = pa.array(typed_sequence)
  File "pyarrow/array.pxi", line 223, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/arrow_writer.py", line 99, in __arrow_array__
    out = pa.array(self.data, type=type)
  File "pyarrow/array.pxi", line 306, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 141, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/builder.py", line 583, in download_and_prepare
    self._download_and_prepare(
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/builder.py", line 661, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/builder.py", line 1077, in _prepare_split
    num_examples, num_bytes = writer.finalize()
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/arrow_writer.py", line 417, in finalize
    self.write_examples_on_file()
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/arrow_writer.py", line 292, in write_examples_on_file
    pa_array = pa.array(typed_sequence)
  File "pyarrow/array.pxi", line 223, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
  File "/home/pankaj/.local/lib/python3.8/site-packages/datasets/arrow_writer.py", line 99, in __arrow_array__
    out = pa.array(self.data, type=type)
  File "pyarrow/array.pxi", line 306, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 141, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object
python-BaseException

@ncoop57
Copy link
Collaborator

ncoop57 commented Nov 29, 2021

Hey @pankajkumar229 that should be enough as long as you read it with streaming mode enabled, else it will not work. However, the error you are showing seems to be different and one that I'm unsure why it is happening. Could you share the CodeClippyMain.py file you are running?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants