Zijdehoen

Zijdehoen

Hi guys, I am asked to try and fit a ML model to a huge mobility dataset that i have, and I tried some models but fail to get a decent [performance metric], so i’d love a fresh set of eyes on this!

Features of the dataset

Each row represents the data for a certain “Origin-Destination” pair, for example pair “31 to 493” meaning this is from place 31 to place 493. The first feature is thus called pair.
For each mode of transport (drive, bike, walk, transit) there are 3 “cost”-features namely: [mode]_time, [mode]_cost, [mode]_convenience. So there are 12 features in total (4 modes x 3 costs)
Some extra features: average_income, cars_per_household, jobs_at_destination (representing the people travel in this pair
4 observed features, one for each mode. These are the features to predict. This is a value between 0 and 1, representing how much % of people in this pair, use this mode of transport.

Additional information

sometimes the 3 costs for “transit” are 999, meaning that there is no transit option (train, tram, …) available for this pair. The usual costs lie between 0 and 100
I deleted the walk_cost feature because every entry was 0.
Here are the distributions of all the features:

https://preview.redd.it/8lt001so6k0c1.png?width=2002&format=png&auto=webp&s=adfe645606746c008941e36fbb35261d0600a8bb

https://preview.redd.it/yeqx4yro6k0c1.png?width=2046&format=png&auto=webp&s=011d05795c203d57d9402805377b0e8741c2378c

https://preview.redd.it/7dtk0gso6k0c1.png?width=2042&format=png&auto=webp&s=6ef5ed2d5f9044dfede43a765ae19afb5c487486

And the correlation matrix:

https://preview.redd.it/61uwu9fx6k0c1.png?width=1718&format=png&auto=webp&s=d299ca5ffa97dadd45804a62ed2b17628f51d0f9

So the goal is to predict those 4 obsrv features. I am very curious which ML model you would use for this and why?

If you have any other suggestions, e.g. pre-processing techniques on the data/features, do share!

Thank you guys!

[D] What ML model to use for this mobility problem?

[D] What ML model to use for this mobility problem?

Features of the dataset

Additional information