Hi guys, I am asked to try and fit a ML model to a huge mobility dataset that i have, and I tried some models but fail to get a decent [performance metric], so i’d love a fresh set of eyes on this!
Features of the dataset
- Each row represents the data for a certain “Origin-Destination” pair, for example pair “31 to 493” meaning this is from place 31 to place 493. The first feature is thus called pair.
- For each mode of transport (drive, bike, walk, transit) there are 3 “cost”-features namely: [mode]_time, [mode]_cost, [mode]_convenience. So there are 12 features in total (4 modes x 3 costs)
- Some extra features: average_income, cars_per_household, jobs_at_destination (representing the people travel in this pair
- 4 observed features, one for each mode. These are the features to predict. This is a value between 0 and 1, representing how much % of people in this pair, use this mode of transport.
Additional information
- sometimes the 3 costs for “transit” are 999, meaning that there is no transit option (train, tram, …) available for this pair. The usual costs lie between 0 and 100
- I deleted the walk_cost feature because every entry was 0.
- Here are the distributions of all the features:
- And the correlation matrix:
So the goal is to predict those 4 obsrv features. I am very curious which ML model you would use for this and why?
If you have any other suggestions, e.g. pre-processing techniques on the data/features, do share!
Thank you guys!