deep_thoughts

I am starting this file in middle of project because to keep track of why I am making changes and I am working alone on this. This file is more of a friend also for me to brain storm. Let's start:
a) The current problem with the project is that I can see deep_trader improving the reward function based on how average reward is increasing on test data. But that means nothing to me now because its very hard to see what decision deep_trader is making from standardized data. Currently I am thinking to implement a code which actually shows decision per episode and that too with real price. This will help me analyze episode in much better way. Let's get started on it and I don't feel version 0 of this will be difficult to implement. 
b) The above idea actually helped and showed me that the way I have implemented reward system its not correct because I am taking highest price of that particular time interval while buying and lowest while selling. In small interval like a minute its bound to make loss because of it. Even on big interval this will be bad. There are two solutions which coming to my mind
1) Take average of opening, closing , highest and lowest price and use that price for that particular minute. If time interval is greater than minute take average of first minutes and then use those minutes average to take time interval average.
2) Make a second network which main task is just to select time when to execute trading action based on continous stream of time-price data trader is receving. This will be a good system to implement above deep_trader.
I will first go with first now and mostly will implement 2 later above it. I think implementing both will be good idea.
c) I also need to update standardization of data and reward system to use actual pricing rather than normalized price.
d) Finally DQN algo is priting some result which I can obbserve, but its not something which I like. The average reward is coming negative on test data and if i ask algorithm to train itself for multiple iteration. It start just holiding I guess its not seeing a lot of positive reinforcement. There are few ideas which I have: 
1) Making reward function so that terminal reward is only used.
2) USE PG instead of DQN as its online learning which looks more applicable here.
3) Change input data vector to have better understanding of markets at one time.
I can't help but think how much a project teaches me, In my personal life I don't like uncertainity like as a human we love to live in loop of repeatedness with certainity. For example I will rush to asking girl I like very eealy whether she like me or not because living in limbo is not what my brain allows. But in project like this you are always in limbo unless you see the first ray of algo improving. I am waiting for that first singal here to kill anxiety. 
e) A good news that dqn_model shown sign of improvement. When i run dqn_model on test for 20 times. It makes profit all 20 times. The average of 1 INR per episode but that is also good because I allow agent to trade only one quantity. When I run a random algo which just take action randomly it actually made a loss 19 times  out of 20. This qdn_model has learnt something.
f) I am trying PGN model now on the stock. The major problem here is that random algorithm is not getting a lot of positive reward a lot of time in the start. This made the algo just reduce to take hold which means doing nothing because this means zero reward as compared to negative reward. I have few idea to sove this:
a) Change algo to take a lot of random actions for a long time at training.
b) Find out which action lead to positive reinforcement in test data and train first a supervised network based on that. Use the network here then to train RL network.