You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
when I launched distributed training without GPUs (tree method hist) to make sure CPU based trainings work I noticed it hanged. And I saw that there were 0 iteration done and I saw
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - Traceback (most recent call last):
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - File "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - self.run()
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - File "/usr/lib64/python2.7/threading.py", line 765, in run
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - self.__target(*self.__args, **self.__kwargs)
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - File "/hdfs/uuid/15b919c9-64e8-43cc-a842-bc62d81ea28d/yarn/data/usercache/o.pryimak/appcache/application_1569890150796_1745812/container_e139_1569890150796_1745812_01_000001/tmp/tracker2210854510286838443.py", line 324, in run
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - self.accept_slaves(nslave)
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - File "/hdfs/uuid/15b919c9-64e8-43cc-a842-bc62d81ea28d/yarn/data/usercache/o.pryimak/appcache/application_1569890150796_1745812/container_e139_1569890150796_1745812_01_000001/tmp/tracker2210854510286838443.py", line 268, in accept_slaves
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - s = SlaveEntry(fd, s_addr)
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - File "/hdfs/uuid/15b919c9-64e8-43cc-a842-bc62d81ea28d/yarn/data/usercache/o.pryimak/appcache/application_1569890150796_1745812/container_e139_1569890150796_1745812_01_000001/tmp/tracker2210854510286838443.py", line 64, in __init__
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - assert magic == kMagic, 'invalid magic number=%d from %s' % (magic, self.host)
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - AssertionError: invalid magic number=542393671 from 172.28.42.144
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger -
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker$TrackerProcessLogger - Tracker Process ends with exit code 0
2019-10-17 00:22:29 INFO ml.dmlc.xgboost4j.java.RabitTracker - Tracker Process ends with exit code 0
2019-10-17 00:22:29 INFO XGBoostSpark - Rabit returns with exit code 0
Could you point out to your source code repo and which version (git sha1) you used to build 1.0.0-Beta so I can try to troubleshoot.
Also any pointers how to work around are welcome.
Can I enable scala based tracker? Do you know how?
Hello nice people,
I came across this article https://medium.com/rapids-ai/nvidia-gpus-and-apache-spark-one-step-closer-2d99e37ac8fd (and that's why I create an issue)
I am very excited to start using. It took some time to learn which versions are available: 1.0.0-Beta and 1.0.0-Beta2. I picked the latter one
when I launched distributed training without GPUs (tree method
hist) to make sure CPU based trainings work I noticed it hanged. And I saw that there were 0 iteration done and I sawCould you point out to your source code repo and which version (git sha1) you used to build 1.0.0-Beta so I can try to troubleshoot.
Also any pointers how to work around are welcome.
Can I enable scala based tracker? Do you know how?